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Breathing KMeans: A Better and 
Faster Alternative to KMeans 
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centroids 


The performance of KMeans is entirely dependent on the centroid 
initialization step. Thus, obtaining inaccurate clusters is highly likely. 


While KMeans++ offers smarter centroid initialization, it does not 
always guarantee accurate convergence (read how KMeans++ works in 
my previous post). This is especially true when the number of 
clusters is high. Here, repeating the algorithm may help. But it 
introduces an unnecessary overhead in run-time. 


Instead, Breathing KMeans is a better alternative here. Here’s how it 
works: 


e Step 1: Initialise k centroids and run KMeans without repeating. 
In other words, don’t re-run it with different initializations. Just 
run it once. 
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e Step 2 — Breathe in step: Add m new centroids and run KMeans 
with (k+m) centroids without repeating. 


e Step 3 — Breathe out step: Remove m centroids from existing 
(k+m) centroids. Run KMeans with the remaining k centroids 
without repeating. 


e Step 4: Decrease m by 1. 
e Step 5: Repeat Steps 2 to 4 until m=0. 


Breathe in step inserts new centroids close to the centroids with the 
largest errors. A centroid’s error is the sum of the squared distance of 
points under that centroid. 


Breathe out step removes centroids with low utility. A centroid’s 
utility is proportional to its distance from other centroids. The intuition 
is that if two centroids are pretty close, they are likely falling in the 
same cluster. Thus, both will be assigned a low utility value, as 
demonstrated below. 


Low utility centroids 


Utility Utility 


(a) Two neighboring centroids (b) Removing one of them makes 
with low utility values (red). the other one very useful (red). 


With these repeated breathing cycles, Breathing KMeans provides a 
faster and better solution than KMeans. In each cycle, new centroids 
are added at “good” locations, and centroids with low utility are 
removed. 
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In the figure below, KMeans++ produced two misplaced centroids. 
KMeanS++ 2zmispacea fat avichawla.substack.com 
centroids 
KMeans Avg. 


Convergence 
Time: 


Breathing KMeans 


Breathing Kmeans 
Avg. Convergence 


centroids 


However, Breathing KMeans accurately clustered the data, with 
a 50% improvement in run-time. 


You can use Breathing KMeans by installing its open-source 
library, bkmeans, as follows: 


pip install bkmeans 


Next, import the library and run the clustering algorithm: 


numpy np 
bkmeans BKMeans 


X=np.random.rand(1000,2) 


bkm BKMeans ( 


bkm. fit (X) 


In fact, the BK Means class inherits from the KMeans class of sklearn. 
So you can specify other parameters and use any of the other methods 
on the BKMeans object as needed. 


More details about Breathing KMeans: GitHub | Paper. 
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How Many Dimensions Should You 
Reduce Your Data To When Using 
PCA? 
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Cumulative Explained Variance Plot 
for PCA 
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When using PCA, it can be difficult to determine the number of 
components to keep. Yet, here's a plot that can immensely help. 


Note: If you don’t know how PCA works, feel free to read my detailed 
post: A Visual Guide to PCA. 


Still, here’s a quick step-by-step refresher. Feel free to skip this part if 
you remember my PCA post. 
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(Determine a system of UNCORRELATED 
axes (x`, y`) to represent the data 


Find the variance of data along > 
dietitian dient: all uncorrelated axes 
and project the data along | Direction of 
directions of HIGH variance (x`) 


m 


<t- SCOLSCI(O3 EPIR ERO 
x 


4 FINAL DATA WITH REDUCED eana 


Step 1. Take a high-dimensional dataset ((x, y) in the above 
figure) and represent it with uncorrelated axes (C, y`) in the above 
figure). Why uncorrelated? 


This is to ensure that data has zero correlation along its dimensions 
and each new dimension represents its individual variance. 


For instance, as data represented along (x, y) is correlated, the 
variance along X is influenced by the spread of data along Y. 


Instead, if we represent data along (x, y`), the variance along X` is 
not influenced by the spread of data along y`. 


The above space is determined using eigenvectors. 


Step 2. Find the variance along all uncorrelated axes (X`, Y`). The 
eigenvalue corresponding to each eigenvector denotes the variance. 


Step 3. Discard the axes with low variance. How many dimensions 
to discard (or keep) is a hyperparameter, which we will discuss below. 
Project the data along the retained axes. 
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When reducing dimensions, the purpose is to retain enough variance of 
the original data. 


As each principal component explains some amount of variance, 
cumulatively plotting the component-wise variance can help identify 
which components have the most variance. 


This is called a cumulative explained variance plot. 
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Cumulative Explained Variance Plot 
for PCA 
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For instance, say we intend to retain ~85% of the data variance. The 
above plot clearly depicts that reducing the data to four components 
will do that. 


Also, as expected, all ten components together represent 100% 
variance of the data. 


Creating this plot is pretty simple in Python. Find the code 
here: PCA-CEV Plot. 
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Ba O + XK hh t y PRU E C PM Code ~| |e | {EB New Mitosheet 


In [1]: import mitosheet 
mitosheet.sheet(analysis_to_replay="id-utbdzhmhvd") 


Edit Dataframes Columns Rows Graphs Format Code View Help Get Support Upgrade to Mito Pro 
Des â G ww T7 o 00 124 ® ooh z id 
Undo Redo Clear Import Export AddCol DelCol Dtype Less More Number Pivot Graph Steps Fullscreen 
City | City 

Name Y Company v ea Salary Yv Status 

str str tr float 

Johnny Maynard White, Mcclain and C{ New Cindychester 8,804.92 | Full Time 

Michael Williams | Scott Inc Ricardomouth 11,117.19 | Full Time 

Laura Flynn Andrade LLC North Melissafurt 3,698.55 | Intern 

Stefanie Archer James and Sons Ricardomouth 8,192.68 | Full Time 

Sierra Garcia Matthews Inc Whitakerbury 10,710.60 | Full Time 

Donna Miller Andrade LLC West Jamesview 3,715.78 | Intern 

Linda Rodriguez Marshall-Holloway | Whitakerbury 12,505.38 | Full Time 

Jonathan Gibson Scott Inc Whiteside 7,953.93 | Intern 

Karl Henry Campos, Reynolds an Whiteside 10,558.20 | Full Time 

Katherine Sims Baker, Allen and Edw¢ Kristaburgh 9,980.62 | Full Time 

Lisa Chambers Nelson-Li Kristaburgh 9,797.61 | Intern 

William Ingram Baker, Allen and Edwé North Melissafurt 4,574.28 | Full Time 

Jessica Thomas Thomas-Spencer New Cindychester 8,093.12 | Full Time 

Lisa Pugh Baker, Allen and Edwe New Cindychester 10,631.22 | Full Time 


employee_dataset v (100 rows, 6 cols) 


Personally, I am a big fan of no-code data analysis tools. They are 
extremely useful in eliminating repetitive code across projects — 
thereby boosting productivity. 


Yet, most no-code tools are often limited in terms of the functionality 
they support. Thus, flexibility is usually a big challenge while using 
them. 


Mito is an incredible open-source tool that allows you to analyze your 
data within a spreadsheet interface in Jupyter without writing any 
code. 


What’s more, Mito recently supercharged its spreadsheet interface with 
AI. As a result, you can now analyze data in a notebook with text 
prompts. 


One of the coolest things about using Mito is that each edit in the 
spreadsheet automatically generates an equivalent Python code. This 
makes it convenient to reproduce the analysis later. 
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Automatic code generation 


from mitosheet.public.v3 import *; register_analysis("id-utbdzhmhva"); 
import pandas as pd 


# Imported employee_dataset.csv 


employee dataset = pd.read_csv(r'employee dataset.csv') 


# group on city and find avg salary and rating 
df2 = employee _dataset.groupby('‘City').agg({'Salary': 'mean', 'Rating': 


# top 5 employees with highest salary 
top_employees = employee dataset.nlargest(5, ‘Salary') 


You can install Mito using pip as follows: 


python -m pip install mitosheet 


Next, to activate it in Jupyter, run the following two commands: 


python -m jupyter nbextension install --py --user mitosheet 
python -m jupyter nbextension enable --py --user mitosheet 
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Be Cautious Before Drawing Any 
Conclusions Using Summary Statistics 


Sinusoid Concentric Circles 
É = eaten ROSSE 


3 at 
ses Se sewer 


. ‘ae 
eC 


Bell Curve 


% # 
Ee : 
\ i - Fj 
als D 


Step function Absolute value 


A . ec ee 
el Oy os 


Datasets with ZERO correlation 


While analyzing data, one may be tempted to draw conclusions solely 
based on its statistics. Yet, the actual data might be conveying a totally 
different story. 


Here's a visual depicting nine datasets with approx. zero correlation 
between the two variables. But the summary statistic (Pearson 
correlation in this case) gives no clue about what's inside the data. 


What's more, data statistics could be heavily driven by outliers or other 
artifacts. I covered this in a previous post here. 
Thus, the importance of looking at the data cannot be stressed enough. 


It saves you from drawing wrong conclusions, which you could have 
made otherwise by looking at the statistics alone. 


For instance, in the sinusoidal dataset above, Pearson correlation may 
make you believe that there is no association between the two 
variables. However, remember that it is only quantifying the extent of 
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a linear relationship between them. Read more about this in another 
one of my previous posts here. 


Thus, if there’s any other non-linear relationship (quadratic, sinusoid, 
exponential, etc.), it will fail to measure that. 
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Use Custom Python Objects InA 


Boolean Context 


without_bool.py 


mie (SELF) a 
Lf.items = [] 


# No __bool__ method 


my_cart Cart() 


my_cart: 
yrint("Cart Not Empty") 


yrint("Cart Empty") 


"Cart Not Empty" # Output 
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with_bool.py 


INESS EE) 
self.items [] 


bool__(self): 
Len(self.items) 


My Cente = rel) 


lf my_cart: 
print("Cart Not Empty") 


orint ("Cart Empty") 


"Cart Empty" # Output 


In a boolean context, Python always evaluates the objects of a custom 
class to True. But this may not be desired in all cases. Here's how you 
can override this behavior. 


The __bool__ dunder method is used to define the behavior of an 
object when used in a boolean context. As a result, you can specify 
explicit conditions to determine the truthiness of an object. 


This allows you to use class objects in a more flexible and intuitive 
way. 


As demonstrated above, without the _ bool__ method 
(without_bool.py), the object evaluates to True. But implementing the 
__bool__ method lets us override this default behavior (with_bool.py). 


Some additional good-to-know details 


When we use ANY object (be it instantiated from a custom or an in- 
built class) in a boolean context, here’s what Python does: 
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Object in 
boolean context 


does its class 
implement the 
_bool__ method? 


Invoke the 
__bool__ method 


does its class 
implement the 
—len_ method? 


Invoke th 
ie i waned return TRUE 


First, Python checks for the __bool__ method in its class 
implementation. If found, it is invoked. If not, Python checks for the 
__len__ method. If found len__ is invoked. Otherwise, Python 
returns True. 


> —— 


This explains the default behavior of objects instantiated from a 
custom class. As the Cart class implemented neither the __bool__ 
method nor the __len__ method, the cart object was evaluated to True. 
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A Visual Guide To Sampling 
Techniques in Machine Learning 
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Simple Random Sampling Cluster Sampling (Single Stage) 


©0000 © ©0000) 
XAKI) @0@ (0000) 
©0000 eco OIII) 
©0000 ®© (0000o) ©0000 


Every data point has equal probability Whole clusters are selected 


Cluster Sampling (Two Stage) 


©6000 @0000 O @@ 


(©0000) 


(0000) 
[00000 00000 T 


1. Select clusters 2. Select data points 


Stratified Sampling 


©0000 ©0000 
©0000 ©0000) 
@@0080 (00000) © © 


00000 ©0000 


1. Create stratas 2. Draw samples from each strata 


When you are dealing with large amounts of data, it is often preferred 
to draw a relatively smaller sample and train a model. But any 
mistakes can adversely affect the accuracy of your model. 
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Bad/ Unrepresentative —— A 
Sample Bad Model 
Representative 
as Good Model 


This makes sampling a critical aspect of training ML models. 
Here are a few popularly used techniques that one should know about: 


® Simple random sampling: Every data point has an equal 
probability of being selected in the sample. 


Simple Random Sampling 


©0000 O O 
©0000 @ @ 


00000 © © © 


©0000 © 
Every data point has equal probability 


® Cluster sampling (single-stage): Divide the data into clusters and 
select a few entire clusters. 


Cluster Sampling (Single Stage) 


IONIT ©0000 
QII 


QII 
(©0000) ©0000 


Whole clusters are selected 


® Cluster sampling (two-stage): Divide the data into clusters, select 
a few clusters, and choose points from them randomly. 
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Cluster Sampling (Two Stage) 


(KX XXX ©0000 
00000 


XYII) 
OIII 0000o 
1. Select clusters 2. Select data points 
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® Stratified sampling: Divide the data points into homogenous 
groups (based on age, gender, etc.), and select points randomly. 


Stratified Sampling 


©0000 QII) © O 


©0000 KXXXYX) © © 
—> — 


Pe E eee 
‘ al 


©0000 (00000) O O 


1. Create stratas 2. Draw samples from each strata 


What are some other sampling techniques that you commonly resort to? 
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You Were Probably Given Incomplete 
Info About A Tuple's Immutability 


my_tupLle 


my_tupLle 
[4g Sl) 


my_tuple[1].append(4) # No Error 


Tuple 


my_tupLle 
[237 4) Modified 


7 H 


When we say tuples are immutable, many Python programmers think 
that the values inside a tuple cannot change. But this is not true. 


The immutability of a tuple is solely restricted to the identity of 
objects it holds, not their value. 


In other words, say a tuple has two objects with IDs 1 and 2. 
Immutability says that the collection of IDs referenced by the tuple 
(and their order) can never change. 


Yet, there is NO such restriction that the individual objects with 
IDs 1 and 2 cannot be modified. 


Thus, if the elements inside the tuple are mutable objects, you can 
indeed modify them. 


And as long as the collection of IDs remains the same, the 
immutability of a tuple is not violated. 
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This explains the demonstration above. As append is an inplace 
operation, the collection of IDs didn't change. Thus, Python didn't raise 
an error. 


We can also verify this by printing the collection of object IDs 
referenced inside the tuple before and after the append operation: 


my_tuple (Gls l2 SI) 


[0]), id(my_tuple[1]) 


my_tuple[1].append(4) 


id( [0]), id(my_tuple[1]) 
145, 43 O) 


ʻe avichawla.substack.com 


As shown above, the IDs pre and post append are the same. Thus, 
immutability isn’t violated. 
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A Simple Trick That Significantly 
Improves The Quality of Matplotlib 
Plots 


# Create a plot 
plt.show() 


Benford’s Law - First Digit Distribution 


1 2 3 4 5 6 7 8 9 


First Digit 


Blurry Plot 


from matplotlib_inline.backend_inline import set_matplotlib_formats 
set_matplotlib formats('svg') 


Superior # Create a plot 
Quali ty plt.show() — Change format 


Plot 


Benford's Law - First Digit Distribution 


1 2 3 4 5 6 uf 8 9 


First Digit 


N 
Q 
z 


Frequency 


Matplotlib plots often appear dull and blurry, especially when scaled or 
zoomed. Yet, here's a simple trick to significantly improve their 
quality. 


Matplotlib plots are rendered as an image by default. Thus, any 
scaling/zooming drastically distorts their quality. 


Instead, always render your plot as a scalable vector graphic (SVG). As 
the name suggests, they can be scaled without compromising the plot's 
quality. 


As demonstrated in the image above, the plot rendered as SVG clearly 
outshines and is noticeably sharper than the default plot. 
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The following code lets you change the render format to SVG. If the 
difference is not apparent in the image above, I would recommend 
trying it yourself and noticing the difference. 


matplotlib_inline.backend_inline set_matplotlib_formats 


set_matpLotlib_formats('svg') 


Alternatively, you can also use the following code: 


config InlineBackend.figure_format 


P.S. If there’s a chance that you don’t know what is being depicted in 
the bar plot above, check out this YouTube video by Numberphile. 
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A Visual and Overly Simplified Guide 
to PCA 
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High dimensional data N Determine a system of UNCORRELATED \ 
| axes (x`, y`) to represent the data 


~~ Find the variance of data along "~ 
all uncorrelated axes | 


and project the data along Direction of 
directions of HIGH variance (x`) LOW variance 


Many folks often struggle to understand the core essence of principal 
component analysis (PCA), which is widely used for dimensionality 
reduction. Here's a simplified visual guide depicting what goes under 
the hood. 


In a gist, while reducing the dimensions, the aim is to retain as much 
variation in data as possible. 


To begin with, as the data may have correlated features, the first step is 
to determine a new coordinate system with orthogonal axes. This is a 
space where all dimensions are uncorrelated. 
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f Determine a system of UNCORRELATED \ 
axes (x`, y`) to represent the data 


The above space is determined using the data's eigenvectors. 


Next, we find the variance of our data along these uncorrelated axes. 
The variance is represented by the corresponding eigenvalues. 


Find the variance of data along 
all uncorrelated axes 


Direction of 


LOW variance 


Direction of 
HIGH variance 


Next, we decide the number of dimensions we want our data to have 
post-reduction (a hyperparameter), say two. As our aim is to retain as 
much variance as possible, we select two eigenvectors with the highest 
eigenvalues. 


Why highest, you may ask? As mentioned above, the variance along an 
eigenvector is represented by its eigenvalue. Thus, selecting the top 
two eigenvalues ensures we retain the maximum variance of the overall 
data. 


Lastly, the data is transformed using a simple matrix multiplication 
with the top two vectors, as shown below: 
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Eigenvectors Dimensionally 
(sorted by decreasing 


eigenvalues) 


init_dimensions 


After reducing the dimension of the 2D dataset used above, we get the 
following. 


Discard directions of LOW variance (y`) 
and project the data along 
directions of HIGH variance (x`) 


Nala DATA WITH REDUCED DIMENSIONS 


This is how PCA works. I hope this algorithm will never feel daunting 
again :) 
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Supercharge Your Jupyter Kernel 
With ipyflow 


This is a pretty cool Jupyter hack I learned recently. 


While using Jupyter, you must have noticed that when you update a 
variable, all its dependent cells have to be manually re-executed. 


Also, at times, isn't it difficult to determine the exact sequence of cell 
executions that generated an output? 


This is tedious and can get time-consuming if the sequence of 
dependent cells is long. 


To resolve this, try ipyflow. It is a supercharged kernel for Jupyter, 
which tracks the relationship between cells and variables. 


In [1]: import numpy as np 


Automatic Execution of Dependent Cells 


In [2]: %flow mode reactive 


In [ ]: x = 10 ## Updating x hutomatically executes its dependents 
In [ ]: y = np.sin(x) ## Dependent on x 
z = np.cos(x) ## Dependent on x 
In [ ]: output = y**2 + z2**2 ## Dependent on y and z 
output 
Export Code 
In [ ]: from ipyflow import code 


print (code(output) ) 


In [ ]: 


Thus, at any point, you can obtain the corresponding code to 
reconstruct any symbol. 


What's more, its magic command enables an automatic recursive re- 
execution of dependent cells if a variable is updated. 


As shown in the demo above, updating the variable X automatically 
triggers its dependent cells. 


Do note that ipyflow offers a different kernel from the default kernel 
in Jupyter. Thus, once you install ipyflow, select the following kernel 
while launching a new notebook: 
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Upload 2 


Notebook: 


| Python 3 (ipyflow) 7 


Python 3 (ipykernel) 


Other: 
Text File 
Folder 


Terminal 


Find more details here: ipyflow. 
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A Lesser-known Feature of Creating 
Plots with Plotly 


Plotly is pretty diverse when it comes to creating different types of 
charts. While many folks prefer it for interactivity, you can also use it 
to create animated plots. 


Here's an animated visualization depicting the time taken by light to 
reach different planets after leaving the Sun. 


Speed of Light Visualization 


0 50 100 150 200 250 


Distance(in Million KMs) 


Several functions in Plotly support animations using 
the animation_frame and animation_group parameters. 


The core idea behind creating an animated plot relies on plotting the 
data one frame at a time. 


For instance, consider we have organized the data frame-by-frame, as 
shown below: 
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planets x_position y_position frame_id 


1st Frame 


1 
1 
1 2nd Frame 
1 
1 


a Location in each frame 


Now, if we invoke the scatter method with the 
animation_frame argument, it will plot the data frame-by-frame, 
giving rise to an animation. 


plotly.express px 


px.scatter(df, 
Xan OS itl O hla 


"y_position", 
"planets", 
"frame_id") 


In the above function call, the data corresponding to frame_id=0 will 


be plotted first. This will be replaced by the data with frame_id=1 in 
the next frame, and so on. 


Find the code for this post here: GitHub. 
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The Limitation Of Euclidean Distance 
Which Many Often Ignore 
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x and y 


are correlated 


Euclidean : Mahalanobis 


Distance : Distance 


Equal Unequal 


Euclidean distance is a commonly used distance metric. Yet, its 
limitations often make it inapplicable in many data situations. 


Euclidean distance assumes independent axes, and the data is 
somewhat spherically distributed. But when the dimensions are 
correlated, euclidean may produce misleading results. 


Mahalanobis distance is an excellent alternative in such cases. It is a 
multivariate distance metric that takes into account the data 
distribution. 


As a result, it can measure how far away a data point is from the 
distribution, which Euclidean cannot. 
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As shown in the image above, Euclidean considers pink and green 
points equidistant from the central point. But Mahalanobis distance 


considers the green point to be closer, which is indeed true, taking into 


account the data distribution. 


Mahalanobis distance is commonly used in outlier detection tasks. As 
shown below, while Euclidean forms a circular boundary for outliers, 
Mahalanobis, instead, considers the distribution — producing a more 
practical boundary. 
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Center 


Euclidean Mahalanobis 


Essentially, Mahalanobis distance allows the data to construct a 
coordinate system for itself, in which the axes are independent and 
orthogonal. 


Computationally, it works as follows: 
° Step 1: Transform the columns into uncorrelated variables. 


° Step 2: Scale the new variables to make their variance equal 
to 1. 


° Step 3: Find the Euclidean distance in this new coordinate 
system, where the data has a unit variance. 


So eventually, we do reach Euclidean. However, to use Euclidean, we 
first transform the data to ensure it obeys the assumptions. 


Mathematically, it is calculated as follows: 


D* = (z — u) -07 . (z — p) 


e x: rows of your dataset (Shape: n_samples*n_dimensions). 


e u: mean of individual dimensions (Shape: 1*n_dimensions). 
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e C^-]: Inverse of the covariance matrix 
(Shape: n_dimensions*n_dimensions). 


e D^2: Square of the Mahalanobis distance 
(Shape: n_samples*n_samples). 


Find more info here: Scipy docs. 
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Visualising The Impact Of 
Regularisation Parameter 


ʻe avichawla.substack.com 


A= 0.00 A= 0.10 A= 1.00 A= 10.00 = 100.00 


= 100.00 


= 100.00 


= 100.00 


= 100.00 


Increasing regularization parameter (A) 


gives simple decision boundaries 


Regularization is commonly used to prevent overfitting. The above 
visual depicts the decision boundary obtained on various datasets by 
varying the regularization parameter. 


As shown, increasing the parameter results in a decision boundary with 
fewer curvatures. Similarly, decreasing the parameter produces a more 
complicated decision boundary. 


But have you ever wondered what goes on behind the scenes? Why 
does increasing the parameter force simpler decision boundaries? 


To understand that, consider the cost function equation below (this is 
for regression though, but the idea stays the same for classification). 


It is clear that the cost increases linearly with the parameter i. 
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Cost Function = Loss + L2 Weight Penalty 


j=1 
Squared Error L2 Regularization Term 


M N N 
= 2 
i=1 j=1 
Vn 


Now, if the parameter is too high, the penalty becomes higher too. 
Thus, to minimize its impact on the overall cost function, the network 
is forced to approach weights that are closer to zero. 


This becomes evident if we print the final weights for one of the 
models, say one at the bottom right (last dataset, last model). 


In [17]: |cl£.coefs_ 


Out[17]: array([[ 8.35476806e-06, -1.29066987e-05, 1.49535843e-05, 
-43964067e-06, 5.46943218e-06, 1.18557175e-05, 


.01037005e-05, 3.70503012e-06, 2.12142850e-06, 
.78452613e-06], 

-35980250e-05, 1.52132934e-05, 3.30938991e-06, 
.41538247e-07, 1.68626879e-05, 1.14315983e-05, 
.64292409e-07, -1.40798113e-06, 1.31551207e-05, 
.52379486e-05]]) 


Having smaller weights effectively nullifies many neurons, producing a 
much simpler network. This prevents many complex transformations, 
that could have happened otherwise. 
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AutoProfiler: Automatically Profile 
Your DataFrame As You Work 


oe avichawla.substack.com 8 DataFrames Sort by: Last Updated v 


v my_df 1,000 x9 


Name 1988| 
Company_Name nsi 
\ Employee_Job_Ti... 19| 
Employee_City 1101 
Employee_Country 1243| 
Employee_Salary t-simmmmmmmms 


notebook.ipynb 


pandas pd 


my_df = pd.read_csv("file.csv") PTT 


min C? è 470 
25% © $ 255419 
: median c? 513965 $ 
Automatically mean ct 507506.3 « 
E 75% g 756775- 
profiled max c? 999840-6 
> Show summary Export C? 
A Employment_Sta... 
# Employee_Rating 
123 Credits 


Pandas AutoProfiler: Automatically profile Pandas DataFrames at each 
execution, without any code. 


AutoProfiler is an open-source dataframe analysis tool in jupyter. It 
reads your notebook and automatically profiles every dataframe in your 
memory as you change them. 


In other words, if you modify an existing dataframe, AutoProfiler will 
automatically update its corresponding profiling. 


Also, if you create a new dataframe (say from an existing dataframe), 
AutoProfiler will automatically profile that as well, as shown below: 
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import pandas as pd 
my_df = pd.read_csv("file.csv") 


v new_df 100 x9 
new_df = my_df.sample(100) Name Ea oo 
Company_Name 115} 
Employee_Job_Ti... N19} 
Employee_City | nol 


New Employee_Country 185] 


Employee_Salary = ttle tt... 
Employment_Sta... | 121 


DataFr. ame Employee_Rating  ll__liiatiatiatilines 


23 Credits ESS 
> my_df 1,000 x9 


Profiling info includes column distribution, summary stats, null stats, 
and many more. Moreover, you can also generate the corresponding 
code, with its export feature. 
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import pandas as pd 8 DataFrames Sort by: Last Updated v 
my_df = pd.read_csv("file.csv") 
v my_df 1,000 x9 a 


my_df [my_df ["Name"] == "Sarah Smith"] A Name © 988. 


Cf Kelly Young 
Cf Renee Davis 


& Patty Jones 
Code added & William Jones 
R & Daniel Lee 
In cell > @arah Smith 
cf Anna Thomas 
& Jennifer Gonzales 
C Nicole Garcia 
C? Derek Perez 


(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 
(0.20%) 


NNNNNNNNNND 


> Show summary Export C? 


Company_Name 
Employee_Job_Ti... | 
Employee_City 
Employee_Country cE 
Employee_Salary 
Employment_Sta... 
Employee_Rating 
Credits 


Find more info here: GitHub Repo. 
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A Little Bit Of Extra Effort Can 
Hugely Transform Your Storytelling 
Skills 


Profit Margin for top five spend categories 


Matplotlib 


j Matplotlib + 
“ a a ia a little bit of 


= extra effort 


Profit Margin for top five spend categories 
Consumers are willing to pay higher prices for aesthetic and decorative items for their home. 


Stationery 20% 
Toys 
Electronics 
Clothing 10% 
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Matplotlib is pretty underrated when it comes to creating professional- 
looking plots. Yet, it is totally capable of doing so. 


For instance, consider the two plots below. 


Yes, both were created using matplotlib. But a bit of formatting makes 
the second plot much more informative, appealing, and easy to follow. 


The title and subtitle significantly aid the story. Also, the footnote 
offers extra important information, which is nowhere to be seen in the 
basic plot. 


Lastly, the bold bar immediately draws the viewer's attention and 
conveys the category's importance. 


So what's the message here? 
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Towards being a good data storyteller, ensure that your plot demands 
minimal effort from the viewer. Thus, don’t shy away from putting in 
that extra effort. This is especially true for professional environments. 


At times, it may be also good to ensure that your visualizations convey 
the right story, even if they are viewed in your absence. 
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A Nasty Hidden Feature of Python 
That Many Programmers Aren't 
Aware Of 


Mutable Default 
Parameter 


add_subject(name, subject, subjects=[] ): 
subjects .append (subject) 
{'name': name, 'subjects': subjects} 


add_subject('Joe', 'Maths') 
add_subject('Bob', 'Maths') 
add_subject('Roy', 'Maths') 


Appended to 
Output: x same list 


{'name': 'Joe', 'subjects': ih 
{'name': 'Bob', 'subjects': ['Maths', F4 ] i 
{'name': 'Roy', 'subjects': ['Maths', "Maths 


' 


Maths' 
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Mutability in Python is possibly one of the most misunderstood and 
overlooked concepts. The above image demonstrates an example that 
many Python programmers (especially new ones) struggle to understand. 


Can you figure it out? If not, let’s understand it. 


The default parameters of a function are evaluated right at the time the 
function is defined. In other words, they are not evaluated each time the 
function is called (like in C++). 


Thus, as soon as a function is defined, the function object stores the 
default parameters in its defaults __ attribute. We can verify this 
below: 
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my_function(a=1, b=2 


Thus, if you specify a mutable default parameter in a function and 
mutate it, you unknowingly and unintentionally modify the parameter 
for all future calls to that function. 


This is shown in the demonstration below. Instead of creating a new 
list at each function call, Python appends the element to the same copy. 


e¢, avichawla.substack.com 
def add_subject(...): 
add_subject.__defaults__ 
Cmr) 
add_subject('Joe', 'Maths') 


add_subject.__defaults__ 
GiMachsmalley) 


add_subject('Bob', 'Maths') 
add_subject.__defaults__ 
Ch MarchSirmealtachSmelies) 


add_subject('Roy', 'Maths') 
add_subject.__defaults__ 
CheMachSimeaMathiSunmeuM atch Seusls) 
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So what can we do to avoid this? 


Instead of specifying a mutable default parameter in a function’s 
definition, replace them with None. If the function does not receive a 
corresponding value during the function call, create the mutable object 
inside the function. 


This is demonstrated below: 


Replace mutable 
parameter 


add_subject(name, subject, subjects=None ): 
subjects None: 
# Create F 


subjects 


subjects.append(subject) 
{'name': name, 'subjects': subjects} 


add_subject('Joe', 'Maths') 
add_subject('Bob', 'Maths') 
add_subject('Roy', 'Maths') 


Output: 


{'name': 'Joe', 'subjects': ['Maths'] } 
{'name': 'Bob', 'subjects': ['Maths'] } 
{'name': 'Roy', 'subjects': ['Maths'] } 
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As shown above, we create a new list if the function didn’t receive any 
value when it was called. This lets you avoid the unexpected behavior 
of mutating the same object. 
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Interactively Visualise A Decision Tree 
With A Sankey Diagram 


Values: [50,0,0] 
Predict: setosa 
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ptt eg on 248 , No avichawla.substack.com 


petal width (cm) <= 1.65 
Values: [0,47,0] 
lor 


petal length (cm) <= 4.95 
Values: [0,47,1] 
HEAD Predict: versicolor 
Values: [50,50,50] 
Predict: setosa 


petal width (cm) > 1.65 
EE Values: [0,0,1] 
Predict: virginica 


petal width (cm) <= 1.75 
Values: [0,49,5] 
Predict: versicolor 


petal width (cm) <= 1.55 
E Valves: (0,0,3) 
petal length (cm) > 4.95 Predict vagaia: petal length (cm) <= 5.45 
Values: [0,2,4] W Valves: [0,2,0] 
Predict: virginica petal width (em) > 1.55 Predict: versicolor 
E Values: (0,2,1] 


petal length (cm) > 2.45 Predict: versicolor petal length (cm) > 5.45 


Values: (0,50,50] 


Values: (0,0,1 
Predict: versicolor La {0,0,1] 


sepal length (cm) <= 5.95 Predict: virginica 
W Values: [0,1,0) 
petal length (cm) <= 4.85 Predict: versicolor 
Values: [0,1,2] 


Predict: virginica sepal length (cm) > 5.95 
E Values: (0,0,2) 
Predict: virginica 


petal width (cm) > 1.75 
Values: [0,1,45] 
Predict: virginica 


petal length (cm) > 4.85 
Values: [0,0,43] 
Predict: virginica 


In one of my earlier posts, I explained why sklearn's decision trees 
always overfit the data with its default parameters (read here if you 
wish to recall). 


To avoid this, it is always recommended to specify appropriate 
hyperparameter values. This includes the max depth of the tree, min 
samples in leaf nodes, etc. 


But determining these hyperparameter values is often done using trial- 
and-error, which can be a bit tedious and time-consuming. 


The Sankey diagram above allows you to interactively visualize the 
predictions of a decision tree at each node. 


Also, the number of data points from each class is size-encoded on all 
nodes, as shown below. 
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Number of Class1 Samples 


petal width (cm) <= 1.75 
Values: [0,49,5] 
a Predict: versicolor 


Number of 
Class2 Samples 


This immediately gives an estimate of the impurity of the node. Based 
on this, you can visually decide to prune the tree. 


For instance, in the full decision tree shown below, pruning the tree at 
a depth of two appears to be reasonable. 


petal length (cm) <= 2.45 
Values: [500,0] 
Predict: setosa 


petal width (cm) <= 1.75 
Values: (0,49,5) 
Predict: versicolor 


petal length (cm) > 2.45 
Values: [0,50,50} 
Predict: versicolor 


petal width (cm) > 1.75 
Valves: [0,1,45] 
Predict: virginica 


Once you have obtained a rough estimate for these hyperparameter 
values, you can train a new decision tree. Next, measure its 
performance on new data to know if the decision tree is generalizing or 
not. 
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Use Histograms With Caution. They 
Are Highly Misleading! 


avichawla.substack.com 
Same Data, VERY Different Histograms 


Bin Count = 10 Bin Count = 15 Bin Count = 20 


, EF EF | 


Bin Count = 25 Bin Count = 30 Bin Count = 35 


al, all ii. | call i 


Bin Count = 40 Bin Count = 45 Bin Count = 50 
60 


ll \ the * i aid 


75 100 125 150 175 200 75 100 125 150 175 200 75 100 125 150 175 200 


Histograms are commonly used for data visualization. But, they can be 
misleading at times. Here's why. 


Histograms divide the data into small bins and represent the frequency 
of each bin. 


Thus, the choice of the number of bins you begin with can significantly 
impact its shape. 


The figure above depicts the histograms obtained on the same data, but 
by altering the number of bins. Each histogram conveys a different 
story, even though the underlying data is the same. 


This, at times, can be misleading and may lead you to draw the wrong 
conclusions. 


The takeaway is NOT that histograms should not be used. Instead, look 
at the underlying distribution too. Here, a violin plot and a KDE plot 
can help. 


Violin plot 
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Similar to box plots, Violin plots also show the distribution of data 
based on quartiles. However, it also adds a kernel density estimation to 
display the density of data at different values. 


Violin Plot 


avichawla.substack.com 


This provides a more detailed view of the distribution, particularly in 
areas with higher density. 


KDE plot 


KDE plots use a smooth curve to represent the data distribution, 
without the need for binning, as shown below: 


KDE Plot 


avichawla.substack.com 


As a departing note, always remember that whenever you condense a 
dataset, you run the risk of losing important information. 


Thus, be mindful of any limitations (and assumptions) of the 
visualizations you use. Also, consider using multiple methods to ensure 
that you are seeing the whole picture. 
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Three Simple Ways To (Instantly) 
Make Your Scatter Plots Clutter Free 


Default Scatter Plot 


Cluttered and 
difficult to 
interpret 


Better 


Wi = Alternatives 


KDE Plot Hexbin Plot 


% 
Q 
6 = E 
Li @ 
© 2 oe 
8 
8 10 16 8 8 1 


Scatter plots are commonly used in data visualization tasks. But when 
you have many data points, they often get too dense to interpret. 


Here are a few techniques (and alternatives) you can use to make your 
data more interpretable in such cases. 


One of the simplest yet effective ways could be to reduce the marker 
size. This, at times, can instantly offer better clarity over the default 
plot. 
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Scatter Plot with 
Reduced Marker Size 


Cluttered and w Clear and 
difficult to easy to 


interpret interpret 


Next, as an alternative to a scatter plot, you can use a density plot, 
which depicts the data distribution. This makes it easier to identify 
regions of high and low density, which may not be evident from a 
scatter plot. 


Default Scatter Plot 


Cluttered and 
difficult to 


interpret 


Lastly, another better alternative can be a hexbin plot. It bins the chart 
into hexagonal regions and assigns a color intensity based on the 
number of points in that area. 
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Default Scatter Plot 
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Cluttered and 
difficult to 
interpret 
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Hexbin Plot 


ee | 


Hexagonally 
grouped 
data 
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A (Highly) Important Point to 
Consider Before You Use KMeans 
Next Time 


50 Models of KMeans 


Dataset 


Frequency 


50 Models of KMeans++ 


5 
g 
F 
£ 


1 2 3 4 5 6 7 8 
Number of Misplaced Centroids 


The most important yet often overlooked step of KMeans is its centroid 
initialization. Here's something to consider before you use it next time. 


KMeans selects the initial centroids randomly. As a result, it fails to 
converge at times. This requires us to repeat clustering several times 
with different initialization. 


Yet, repeated clustering may not guarantee that you will soon end up 
with the correct clusters. This is especially true when you have many 
centroids to begin with. 


Instead, KMeans++ takes a smarter approach to initialize centroids. 


The first centroid is selected randomly. But the next centroid is chosen 
based on the distance from the first centroid. 
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Initial Centroid 2 
e e 


In other words, a point that is away from the first centroid is more 
likely to be selected as an initial centroid. This way, all the initial 
centroids are likely to lie in different clusters already, and the 
algorithm may converge faster and more accurately. 


The impact is evident from the bar plots shown below. They depict the 
frequency of the number of misplaced centroids obtained (analyzed 
manually) after training 50 different models with KMeans and 
KMeans++. 


On the given dataset, out of the 50 models, KMeans only produced 
zero misplaced centroids once, which is a success rate of just 2%. 
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50 Models of KMeans 


KMeans converges 
correctly only once 


s 
z 
3 
= 
z 
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50 Models of KMeans++ 


KMeans++ 
converges 
correctly always 


Frequency 


1 2 6 7 8 
Number of Misplaced Centroids 


Luckily, if you are using sklearn, you don’t need to worry about the 
initialization step. This is because sklearn, by default, resorts to the 
KMeans++ approach. 


However, if you have a custom implementation, do give it a thought. 
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Why You Should Avoid Appending 
Rows To A DataFrame 


DataFrame Size vs Row Append Time 


T 
£ 
E> 
o 
£ 
Enz 
no] 
i= 
v 
Qa 
2) 
< 


N Uptrend in 
Append 
Run-time 


30000 40000 50000 
Total Rows 


As we append more and more rows to a Pandas DataFrame, the append 
run-time keeps increasing. Here's why. 


A DataFrame is a column-major data structure. Thus, consecutive 
elements in a column are stored next to each other in memory. 


coli cola col3 | | colg 
(0) 
1 
2 
S 


Whole column is 
\ es in contiguous 
blocks of memory 
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As new rows are added, Pandas always wants to preserve its column- 
major form. 


But while adding new rows, there may not be enough space to 
accommodate them while also preserving the column-major structure. 


In such a case, existing data is moved to a new memory location, where 
Pandas finds a contiguous block of memory. 


Thus, as the size grows, memory reallocation gets more frequent, and 
the run time keeps increasing. 


The reason for spikes in this graph may be because a column taking 
higher memory was moved to a new location at this point, thereby 
taking more time to reallocate, or many columns were shifted at once. 


So what can we do to mitigate this? 


The increase in run-time solely arises because Pandas is trying to 
maintain its column-major structure. 


Thus, if you intend to grow a dataframe (row-wise) this frequently, it is 
better to first convert the dataframe to another data structure, a 
dictionary or a numpy array, for instance. 


Carry out the append operations here, and when you are done, convert 
it back to a dataframe. 


P.S. Adding new columns is not a problem. This is because this 
operation does not conflict with other columns. 
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Matplotlib Has Numerous Hidden 
Gems. Here's One of Them. 


Lines on top 


Default 
rendering 


-= Controlled 


rendering 


Dots on top 


DUE OULOEOK, We 
plt.scatter(X, y, 


ordering m 


One of the best yet underrated and underutilized potentials of 
matplotlib is customizability. Here's a pretty interesting thing you can 
do with it. 


By default, matplotlib renders different types of elements (also called 
artists), like plots, legend, texts, etc., in a specific order. 


But this ordering may not be desirable in all cases, especially when 
there are overlapping elements in a plot, or the default rendering is 
hiding some crucial details. 


With the zorder parameter, you can control this rendering order. As a 
result, plots with higher zorder value appear closer to the viewer and 
are drawn on top of artists with lower zorder values. 
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Lastly, in the above demonstration, if we specify zorder=0 for the line 
plot, we notice that it goes behind the grid lines. 


3) e 


Line behind grid 


You can find more details about zorder here: Matplotlib docs. 
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A Counterintuitive Thing About 
Python Dictionaries 


Added 
4 keys 


(Float). 
(Give) aay 
(bool). 
( 


Spier ares) © 


dict only 


(DOOL) <=) has 
(strine) "h 
2 keys 


Despite adding 4 distinct keys to a Python dictionary, can you tell why 
it only preserves two of them? 


Here’s why. 


In Python, dictionaries find a key based on the equivalence of hash 
(computed using hash()), but not identity (computed using id()). 


In this case, there’s no doubt that 1.0, 1, and True inherently have 
different datatypes and are also different objects. This is shown below: 
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Yet, as they share the same hash value, the dictionary considers them 
as the same keys. 


= avichawla.substack.com 


hash(True) 


But did you notice that in the demonstration, the final key is 1.0, while 
the value corresponds to the key True. 
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‘One (string) '} 


float key value of boolean key 
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This is because, at first, 1.0 is added as a key and its value is 'One 
(float) '. Next, while adding the key 1, python recognizes it as an 
equivalence of the hash value. 


Thus, the value corresponding to 1.0 is overwritten by 'One (int)', 
while the key (1.0) is kept as is. 


Finally, while adding True, another hash equivalence is encountered 
with an existing key of 1.0. Yet again, the value corresponding to 1.0, 
which was updated to 'One (int) ' in the previous step, is overwritten 
by 'One (bool)'. 


I am sure you may have already guessed why the string key ‘1’ is 
retained. 


63 


Es avichawla.substack.com 


Probably The Fastest Way To Execute 
Your Python Code 


© © @ big_loop.py 
result 
forkamintral 


if (a+b)%11 0: 
result.append((a,b)) 


© © Python 


S python big_loop.py 
# Run-time: 10.9s 


© © Codon 


$ codon run big_loop.py 


Many Python programmers are often frustrated with Python’s run-time. 
Here’s how you can make your code blazingly fast by changing just 
one line. 


Codon is an open-source, high-performance Python compiler. In 
contrast to being an interpreter, it compiles your python code to fast 
machine code. 

Thus, post compilation, your code runs at native machine code speed. 
As a result, typical speedups are often of the order 50X or more. 
According to the official docs, if you know Python, you already know 
99% of Codon. There are very minute differences between the two, 
which you can read here: Codon docs. 
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Find some more benchmarking results between Python and Codon 
below: 
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@ fib.py @ pi.py 


fib(N): pi_approx(n_terms): 


Function to find the ] Function to find the 
Nth Fibonacci number. approximate value of pi. 


fib(N) = fib(N-1) + fib(N-2) | pi = 4*(1 - 1/3 + 1/5 - 1/7... 


nnn nnn 


@ Python Codon 


python fib.py # N=35 $ codon run fib.py # N=35 
Time: 2.53s # Time: 0.04s 6 Faste 


python fib.py # N=45 codon run fib.py # N=45 


Time: S # Time: 4.89s 60x | 


AQ 


python pi.py # n_terms=10%8 run pi.py # n_terms=10*8 
t Time: 14.7s 35s 40x Faste 
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Are You Sure You Are Using The 
Correct Pandas Terminologies? 


Selecting.py Slicing.py 


— Extracting col(s) — Extracting row(s) 


df.col df.iloc[0] 
df.loc[:, 'col-name'] df. loc['row-name' ] 


Le Indexing.py al 


— Selecting Slicing 
df.iloc['row-idx', 'col-idx'] 


df.loc['row-name', 'col-name'] 


Filtering.py 


— Conditional subsetting 


df[df.col>10] 
Cifslidijec oilers Gia Ame Bale) 


Many Pandas users use the dataframe subsetting terminologies 
incorrectly. So let's spend a minute to get it straight. 


SUBSETTING means extracting value(s) from a dataframe. This can be 
done in four ways: 


1) We call it SELECTING when we extract one or more of its 
COLUMNS based on index location or name. The output contains some 
columns and all rows. 
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coli col2 col3 col¢ cola col4 


2) We call it SLICING when we extract one or more of its ROWS based 
on index location or name. The output contains some rows and all 
columns. 


coll cola colS colg 


3) We call it INDEXING when we extract both ROWS and COLUMNS 
based on index location or name. 


coll cola col3 col4 


4) We call it FILTERING when we extract ROWS and COLUMNS 
based on conditions. 


coll col& col3 coly 


coli cold col3 coly 


2n 
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Of course, there are many other ways you can perform these four 
operations. 


Here’s a comprehensive Pandas guide I prepared once: Pandas Map. 
Please refer to the “DF Subset” branch to read about various subsetting 
methods :) 
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Is Class Imbalance Always A Big 
Problem To Deal With? 


pA SE Imbalance with 


@ Minority class 


low "class 
separability" 


Imbalance with © Majority class 
@ Minority class 
high "class 


separability" 


Addressing class imbalance is often a challenge in ML. Yet, it may not 
always cause a problem. Here's why. 


One key factor in determining the impact of imbalance is class 
separability. 


As the name suggests, it measures the degree to which two or more 
classes can be distinguished or separated from each other based on 
their feature values. 


When classes are highly separable, there is little overlap between their 
feature distributions (as shown below). This makes it easier for a 
classifier to correctly identify the class of a new instance. 
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Feature 
Distribution 


Thus, despite imbalance, even if your data has a high degree of class 
separability, imbalance may not be a problem per se. 


To conclude, consider estimating the class separability before jumping 
to any sophisticated modeling steps. 


This can be done visually or by evaluating imbalance-specific metrics 
on simple models. 


The figure below depicts the decision boundary learned by a logistic 
regression model on the class-separable dataset. 


Decision yi 


Boundary 


70 


Ne . 
PAL avichawla.substack.com 


A Simple Trick That Will Make 
Heatmaps More Elegant 


Color-encoded 
Wi = heatmap 
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Heatmaps often make data analysis much easier. Yet, they can be 
further enriched with a simple modification. 


A traditional heatmap represents the values using a color scale. Yet, 
mapping the cell color to numbers is still challenging. 


Embedding a size component can be extremely helpful in such cases. 
In essence, the bigger the size, the higher the absolute value. 


This is especially useful to make heatmaps cleaner, as many values 
nearer to zero will immediately shrink. 


In fact, you can represent the size with any other shape. Below, I 
created the same heatmap using a circle instead: 
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Color-encoded 
heatmap 
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Color + size 
encoded 
heatmap 


we 


Find the code for this post here: GitHub. 
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A Visual Comparison Between 
Locality and Density-based Clustering 


Dataset B Dataset A 


Dataset C 


pae eg cn 
P a: EF 
A. 


at 
. Fan os pete 


Dataset D 


The utility of KMeans is limited to datasets with spherical clusters. 
Thus, any variation is likely to produce incorrect clustering. 


Density-based clustering algorithms, such as DBSCAN, can be a better 
alternative in such cases. 


They cluster data points based on density, making them robust to 
datasets of varying shapes and sizes. 


The image depicts a comparison of KMeans vs. DBSCAN on multiple 
datasets. 


As shown, KMeans only works well when the dataset has spherical 
clusters. But in all other cases, it fails to produce correct clusters. 


Find more here: Sklearn Guide. 
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Why Don't We Call It Logistic 
Classification Instead? 


Logistic Regkession 


Classification 


Have you ever wondered why logistic regression is called "regression' 
when we only use it for classification tasks? Why not call it "logistic 
classification" instead? Here's why. 


Most of us interpret logistic regression as a classification algorithm. 
However, it is a regression algorithm by nature. This is because it 
predicts a continuous outcome, which is the probability of a class. 


Probability 
of class 


ee) & — we 


Logistic Regression 


It is only when we apply those thresholds and change the interpretation 
of its output that the whole pipeline becomes a classifier. 
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Probability 
of class 
Threshold 


Classifier 


Yet, intrinsically, it is never the algorithm performing the 
classification. The algorithm always adheres to regression. Instead, it 
is that extra step of applying probability thresholds that classifies a 
sample. 


75 


Ne . 
PAL avichawla.substack.com 


A Typical Thing About Decision Trees 
Which Many Often Ignore 


Dataset 


Noisy 


samples 


Decision Boundary 


Overfitted 
boundaries 


Although decision trees are simple and intuitive, they always need a bit 
of extra caution. Here's what you should always remember while 
training them. 


In sklearn's implementation, by default, a decision tree is allowed to 
grow until all leaves are pure. This leads to overfitting as the model 
attempts to classify every sample in the training set. 


There are various techniques to avoid this, such as pruning and 
ensembling. Also, make sure that you tune hyperparameters if you use 
sklearn's implementation. 


This was a gentle reminder as many of us often tend to use sklearn’s 
implementations in their default configuration. 


It is always a good practice to know what a default implementation is 
hiding underneath. 
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Always Validate Your Output Variable 
Before Using Linear Regression 


Distribution of Y Regression fit on Y 


Skewed 


T 
|| 


y 


log transform 


Distribution of log(Y) Regression teen Ioni] 


log(Y) 


The effectiveness of a linear regression model largely depends on how 
well our data satisfies the algorithm's underlying assumptions. 


Linear regression inherently assumes that the residuals (actual- 
prediction) follow a normal distribution. One way this assumption may 
get violated is when your output is skewed. 


As a result, it will produce an incorrect regression fit. 


But the good thing is that it can be corrected. One common way to 
make the output symmetric before fitting a model is to apply a log 
transform. 


It removes the skewness by evenly spreading out the data, making it 
look somewhat normal. 


One thing to note is that if the output has negative values, a log 
transform will raise an error. In such cases, one can apply translation 
transformation first on the output, followed by the log. 
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A Counterintuitive Fact About Python 
Functions 
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# Define a function 
>>> def my_func(): pass 


ew attributes to function object 
>>> my_func.my_attr = 'new_attribute' 
>>> my_func.my_attr 
'new_attribute' 


Pass aS an argument to 
>>> def new_func(f): pass 


>>> new_func(my_func) 


SSS MUM. 

"my_func' 

SS> (i NUNC s—_ cher __ 
{'my_attr': 'new_attribute'} 


Everything in python is an object instantiated from some class. This 
also includes functions, but accepting this fact often feels 
counterintuitive at first. 


Here are a few ways to verify that python functions are indeed objects. 


The friction typically arises due to one's acquaintance with other 
programming languages like C++ and Java, which work very 
differently. 


However, python is purely an object-oriented programming (OOP) 
language. You are always using OOP, probably without even realizing 
it. 
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Why Is It Important To Shuffle Your 
Dataset Before Training An ML Model 


Unshuffled_data.py 


20/100] 


0] 


Fails to Converges 
converge Seamlessly 


ML models may fail to converge for many reasons. Here's one of them 
which many folks often overlook. 


If your data is ordered by labels, this could negatively impact the 
model's convergence and accuracy. This is a mistake that can typically 
go unnoticed. 


In the above demonstration, I trained two neural nets on the same data. 
Both networks had the same initial weights, learning rate, and other 
settings. 


However, in one of them, the data was ordered by labels, while in 
another, it was randomly shuffled. 


As shown, the model receiving a label-ordered dataset fails to 
converge. However, shuffling the dataset allows the network to learn 
from a more representative data sample in each batch. This leads to 
better generalization and performance. 


In general, it's a good practice to shuffle the dataset before training. 
This prevents the model from identifying any label-specific yet non- 
existing patterns. 


In fact, it is also recommended to alter batch-specific data in every 
epoch. 
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The Limitations Of Heatmap That Are 
Slowing Down Your Data Analysis 


Heatmap 


Clustered 
Heatmap 


S 


1 1 1 1 1 1 
Patient6 Patient4 Patient3Patient2 Patient1 Patient 5 


Heatmaps often make data analysis much easier. Yet, they do have 
some limitations. 


A traditional heatmap does not group rows (and features). Instead, its 
orientation is the same as the input. This makes it difficult to visually 
determine the similarity between rows (and features). 


Clustered heatmaps can be a better choice in such cases. It clusters the 
rows and features together to help you make better sense of the data. 


They can be especially useful when dealing with large datasets. While 
a traditional heatmap will be visually daunting to look at. 


However, the groups in a clustered heatmap make it easier to visualize 
similarities and identify which rows (and features) go with one 
another. 


To create a clustered heatmap, you can use the sns.clustermap() 
method from Seaborn. More info here: Seaborn docs. 
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The Limitation Of Pearson 
Correlation Which Many Often 
Ignore 


Linear Data 


Non-linear Data 


Pearson correlation is commonly used to determine the association 
between two continuous variables. But many often ignore its 
assumption. 


Pearson correlation primarily measures the LINEAR relationship 
between two variables. As a result, even if two variables have a non- 
linear but monotonic relationship, Pearson will penalize that. 


One great alternative is the Spearman correlation. It primarily assesses 
the monotonicity between two variables, which may be linear or non- 
linear. 


What's more, Spearman correlation is also useful in situations when 
your data is ranked or ordinal. 
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Why Are We Typically Advised To Set 
Seeds for Random Generators? 


Input Data 


Transformation 
by neural nets 
of same 
structure 


Model 4 


Model 5 


GE 


From time to time, we advised to set seeds for random numbers before 
training an ML model. Here's why. 


The weight initialization of a model is done randomly. Thus, any 
repeated experiment never generates the same set of numbers. This can 
hinder the reproducibility of your model. 


As shown above, the same input data gets transformed in many ways 
by different neural networks of the same structure. 


Thus, before training any model, always ensure that you set seeds so 
that your experiment is reproducible later. 
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An Underrated Technique To Improve 
Your Data Visualizations 


annotations.py 


t 1) Create plot Arrow location 
ERCOLE G 4¥)) 


Text location 


nnotations 

('First Wave', 
('2020-02-20', 3400 
('2020-01-31' 


ax.annotate 


Annotated Plot 


Third Wave 


4000 First Wave 


Fourth Wave Fifth Wave 


Second Wave 


At times, ensuring that your plot conveys the right message may 
require you to provide additional context. Yet, augmenting extra plots 
may clutter your whole visualization. 


One great way to provide extra info is by adding text annotations to a 
plot. 


In matplotlib, you can use annotate(). It adds explanatory texts to 
your plot, which lets you guide a viewer's attention to specific areas 
and aid their understanding. 


Find more info here: Matplotlib docs. 
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A No-Code Tool to Create Charts and 
Pivot Tables in Jupyter 


notebook.ipynb 


pivottablejs pivot_ui 
pivot_ui(df) 


[pop out] 


‘Table v] | {Count Employment_Status ~ 


; Employment_Status 
Name ~ Employee_City + = = Full Time Intern Totals 
Employee_City 

Company_Name ~ Aliciafort 


z Kristaburgh 
Employee_Job_Title ~ rek 

New Cindychester 
Employee_Country ~ New Russellton 
North Melissafurt 
Employee_Salary ~ 
Ricardomouth 
Employee_Rating ~ Wardfort 

West Jamesview 
Whitakerbury 


Whiteside 


Here's a quick and easy way to create pivot tables, charts, and group 
data without writing any code. 


PivotTableJS is a drag-n-drop tool for creating pivot tables and 
interactive charts in Jupyter. What's more, you can also augment pivot 
tables with heatmaps for enhanced analysis. 


Find more info here: PivotTableJS. 


Watch a video version of this post for enhanced 
understanding: Video. 
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If You Are Not Able To Code A 
Vectorized Approach, Try This. 


df . shape pe avichawla.substack.com 
(100000, 9) 


1) iterrows() 

Stimeit [my_func(row) for index, row in df.iterrows() ] 
2.63 s + 7.55 ms per loop Slowest 

2) apply() 


Stimeit df.apply(my_func, axis = 1) 


3) itertuples() 

Stimeit [my _func(row) for row in df.itertuples() ] 
87.3 ms £ 486 us per loop Fast 

4) to_numpy() 


Stimeit np arr = df.to_numpy(); [my_func(row) for row in np arr] 


32.9 ms + 240 us per loop Fastest 


Although we should never iterate over a dataframe and prefer 
vectorized code, what if we are not able to come up with a vectorized 
solution? 


In my yesterday's post on why iterating a dataframe is costly, someone 
posed a pretty genuine question. They asked: “Let’s just say you are 
forced to iterate. What will be the best way to do so?” 


Firstly, understand that the primary reason behind the slowness of 
iteration is due to the way a dataframe is stored in memory. (If you 
wish to recap this, read yesterday’s post here.) 

Being a column-major data structure, retrieving its rows requires 


accessing non-contiguous blocks of memory. This increases the run- 
time drastically. 
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Yet, if you wish to perform only row-based operations, a quick fix is to 
convert the dataframe to a NumPy array. 


NumPy is faster here because, by default, it stores data in a row-major 
manner. Thus, its rows are retrieved by accessing contiguous blocks of 
memory, making it efficient over iterating a dataframe. 


That being said, do note that the best way is to write vectorized code 
always. Use the Pandas-to-NumPy approach only when you are truly 
struggling with writing vectorized code. 
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Why Are We Typically Advised To 
Never Iterate Over A DataFrame? 


df . shape Jq avichawla.substack.com 


(32768000, 9) 


Access column 


%štimeit df["my column" ] 


1.73 us + 546 ns per loop 


Access row 


Stimeit df.iloc[0] 


38.4 us + 1.47 us per loop 


From time to time, we are advised to avoid iterating on a Pandas 
DataFrame. But what is the exact reason behind this? Let me explain. 


A DataFrame is a column-major data structure. Thus, consecutive 
elements in a column are stored next to each other in memory. 


As processors are efficient with contiguous blocks of memory, 


retrieving a column is much faster than a row. 


But while iterating, as each row is retrieved by accessing non- 
contiguous blocks of memory, the run-time increases drastically. 


In the image above, retrieving over 32M elements of a column was still 


over 20x faster than fetching just nine elements stored in a row. 
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Manipulating Mutable Objects In 
Python Can Get Confusing At Times 


Method1.py | Method2.py 


) Define list # 1) Define list 
eos] A) = [la 


2) Assign b to a 
b =a 


# 4) Print a 
a 
4, 5] # Modified ks ag en, # Modified 
am # 5) Print b 
15 
# Unchanged e—_s 16 1 Z 3, 4, 5] # Modified 


Did you know that with mutable objects, “a +=” and “a = a +” work 
differently in Python? Here's why. 


Let's consider a list, for instance. 


When we use the = operator, Python creates a new object in memory 
and assigns it to the variable. 


Thus, all the other variables still reference the previous memory 
location, which was never updated. This is shown 
in Method1.py above. 


But with the += operator, changes are enforced in-place. This means 
that Python does not create a new object and the same memory location 
is updated. 


Thus, changes are visible through all other variables that reference the 
same location. This is shown in Method2.py above. 


We can also verify this by comparing the id() pre-assignment and post- 
assignment. 


88 


@ 
e\o avichawla.substack.com 


Method1.py 


id(a) 
unchanged 


With “a = a +”, the id gets changed, indicating that Python created a 
new object. However, with “a +=”, id stays the same. This indicates 
that the same memory location was updated. 
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This Small Tweak Can Significantly 
Boost The Run-time of KMeans 


avichawla.substack.com 


KMeans Clustering 


Incorrect 
clusters 
with KMeans 


Correct 
clusters 
with KMeans++ 


KMeans is a popular but high-run-time clustering algorithm. Here's 
how a small tweak can significantly improve its run time. 


KMeans selects the initial centroids randomly. As a result, it fails to 
converge at times. This requires us to repeat clustering several times 
with different initialization. 


Instead, KMeans++ takes a smarter approach to initialize centroids. 
The first centroid is selected randomly. But the next centroid is chosen 
based on the distance from the first centroid. 


In other words, a point that is away from the first centroid is more 
likely to be selected as an initial centroid. This way, all the initial 
centroids are likely to lie in different clusters already and the 
algorithm may converge faster. 
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The illustration below shows the centroid initialization of KMeans++: 


avichawla.substack.com 


Initial Centroid 1 Initial Centroid 2 


teen e 


a e 


Initial Centroid 4 Initial Centroid 3 
® 
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Most Python Programmers Don't 
Know This About Python OOP 


0 o Jq avichawla.substack.com 


class Point2D: 
def __new__(cls, 


en th ndition is T 
print("Creating Object!") 
return super(). 
else: 
raise Ty yx("x and y must be integers") 


_new__(cls) # Return new object 


(Sour, 3, WE 


x 
y 
( 


"Object Initialized!") 


>>> p1 = Point2D(1,2) 
"Creating Object!" # from __new__ 
"Object Initialized!" # from __init__ 


Sep P2 = PosineAo(al.S, 2.5) 
j = x and y must be integers 


Most python programmers misunderstand the __init__() method. They 
think that it creates a new object. But that is not true. 


When we create an object, it is not the __init__() method that 
allocates memory to it. As the name suggests init__() only assigns 
value to an object's attributes. 


> —— 


Instead, Python invokes the __new__() method first to create a new 
object and allocate memory to it. But how is that useful, you may 
wonder? There are many reasons. 


For instance, by implementing the __ new__() method, you can apply 
data checks. This ensures that your program allocates memory only 
when certain conditions are met. 


92 


avichawla.substack.com 


Other common use cases involve defining singleton classes (classes 
with only one object), creating subclasses of immutable classes such as 
tuples, etc. 
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Who Said Matplotlib Cannot Create 
Interactive Plots? 


-10 -$ 0 5 10 15 20 


Evaluate x + 20*np.sin(x) 


& Please watch a video version of this post for better 
understanding: Video Link. 


In most cases, Matplotlib is used to create static plots. But very few 
know that it can create interactive plots too. Here's how. 


By default, Matplotlib uses the inline mode, which renders static 
plots. However, with the %matplotlib widget magic command, you 
can enable interactive backend for Matplotlib plots. 


What's more, its widgets module offers many useful widgets. You can 
integrate them with your plots to make them more elegant. 


Find a detailed guide here: Matplotlib widgets. 
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Don't Create Messy Bar Plots. 
Instead, Try Bubble Charts! 


Country 
Country A 
Country B 
Country C 


Country D 
bubble_chart.py Country E 
Country F 
Country G 
Country H 
Country I 
Country J 
Country K 
Country L 
Country M 
Country N 


Wf NAMM ULL bia 


Year 


px.bar(df, 


px.scatter(df, 


Country A 


*@0@ 


Country B 
Country C 
Country D 


eoe@eee 


Country E 
Country F 
Country G 
Country H 
Country I 
Country J 
Country K 
Country L 
Country M 


Ok ROK ROLOl-E JO A ROK) 


Country N 


@®@ :OO@@e+ +3505 
OO eOCOOOGee@eere 
@*e@@2OOO0O0O0O®@: Oe 
©. 0000000000. OO 
Oe eQOOOOOe+ oe 

@e OOOO ee reese 
@o 000009 o 
Qoocdeeocee:-: 


Country O 


N 
[=] 
[=] 
t=] 


Bar plots often get incomprehensible and messy when we have many 
categories to plot. 


A bubble chart can be a better choice in such cases. They are like 
scatter plots but with one categorical and one continuous axis. 


Compared to a bar plot, they are less cluttered and offer better 
comprehension. 


Of course, the choice of plot ultimately depends on the nature of the 
data and the specific insights you wish to convey. 


Which plot do you typically prefer in such situations? 
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You Can Add a List As a Dictionary's 
Key (Technically)! 


my_dict HH dict unhashable 
my_List lar 2rs Ht List list 


my_dict[my_list] 
TypeError: unhashable ty 


hashable 
list 


my_List My Ersic (hiZi 3 
my_dict[my_list] True 


print (my_dict) 
2, 3]: True} 


Python raises an error whenever we add a list as a dictionary's key. But 
do you know the technical reason behind it? Here you go. 


Firstly, understand that everything in Python is an object instantiated 
from some class. Whenever we add an object as a dict's key, Python 
invokes the __hash__ function of that object's class. 


While classes of int, str, tuple, frozenset, etc. implement the __ hash__ 
method, it is missing from the list class. That is why we cannot add a 
list as a dictionary's key. 


Thus, technically if we extend the list class and add this method, a list 
can be added as a dictionary's key. 


While this makes a list hashable, it isn't recommended as it can lead to 
unexpected behavior in your code. 
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Most ML Folks Often Neglect This 
While Using Linear Regression 


Heteroscedasticity 
Regression fit 


Non-constant 
residual 
variance 


Homoscedasticity 


Regression fit 


Constant 
residual 


variance 


The effectiveness of a linear regression model is determined by how 
well the data conforms to the algorithm's underlying assumptions. 


One highly important, yet often neglected assumption of linear 
regression is homoscedasticity. 


A dataset is homoscedastic if the variability of residuals (=actual- 
predicted) stays the same across the input range. 


In contrast, a dataset is heteroscedastic if the residuals have non- 
constant variance. 


Homoscedasticity is extremely critical for linear regression. This is 
because it ensures that our regression coefficients are reliable. 
Moreover, we can trust that the predictions will always stay within the 
same confidence interval. 
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35 Hidden Python Libraries That Are 
Absolute Gems 


35 Gem Python 

se Libraries n 

Waiting To Be 
Discovered 


I reviewed 1,000+ Python libraries and discovered these hidden gems I 
never knew even existed. 


Here are some of them that will make you fall in love with Python and 
its versatility (even more). 

Read this full list here: 
https://avichawla.substack.com/p/35-gem-py-libs. 
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Use Box Plots With Caution! They 
May Be Misleading. 


Identical Box Plots for Different Datasets 


Dataset A 3 Dataset B a Dataset C 


E 
v 
D 
O 
2 
2 
x= 


Violin Plot 


avichawla.substack.com 


Box plots are quite common in data analysis. But they can be 
misleading at times. Here's why. 


A box plot is a graphical representation of just five numbers — min, 
first quartile, median, third quartile, and max. 


Thus, two different datasets with similar five values will produce 
identical box plots. This, at times, can be misleading and one may draw 
wrong conclusions. 


The takeaway is NOT that box plots should not be used. Instead, look 
at the underlying distribution too. Here, histograms and violin plots 
can help. 


Lastly, always remember that when you condense a dataset, you don't 
see the whole picture. You are losing essential information. 
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An Underrated Technique To Create 
Better Data Plots 


inset.py 


5.0f 
-100 -75 -50 -25 0 


axin.p] 


axin.set_xlil 
axin.s 


zoom(axin) 


5.0 
-100 -75 -50 -25 0 25 50 75 100 


While creating visualizations, there are often certain parts that are 
particularly important. Yet, they may not be immediately obvious to 
the viewer. 


A good data storyteller will always ensure that the plot guides the 
viewer's attention to these key areas. 


One great way is to zoom in on specific regions of interest in a plot. 
This ensures that our plot indeed communicates what we intend it to 
depict. 


In matplotlib, you can do so using indicate_inset_zoom(). It adds an 
indicator box, that can be zoomed-in for better communication. 


Find more info here: Matplotlib docs. 
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The Pandas DataFrame Extension 
Every Data Scientist Has Been 
Waiting For 


PyGWalker 


Turn your pandas dataframe into a Tableau-style User Interface 
for visual analysis 


eee Kanaries/pygwalker 


Color 


Watch a video version of this post for better 
understanding: Video Link. 


PyGWalker is an open-source alternative to Tableau that transforms 
pandas dataframe into a tableau-style user interface for data 
exploration. 


It provides a tableau-like UI in Jupyter, allowing you to analyze data 
faster and without code. 


Find more info here: PyGWalker. 
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Supercharge Shell With Python Using 
Xonsh 


Run shell commands Use Python 


$ cat file.txt $ import pandas as pd 


$ cd Desktop S UV lige = K A] 


"E Shell gee Rython ™ 


range(5): 4 + 'llo' 7 


grep DLH 7 


Traditional shells have a limitation for python users. At a time, users 
can either run shell commands or use IPython. 


As a result, one has to open multiple terminals or switch back and forth 
between them in the same terminal. 


Instead, try Xonsh. It combines the convenience of a traditional shell 
with the power of Python. Thus, you can use Python syntax as well as 
run shell commands in the same shell. 


Find more info here: Xonsh. 
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Most Command-line Users Don't 
Know This Cool Trick About Using 
Terminals 


bash-3.2$ python test_script.py =F 


Terminal 


Terminal 
Blocked 


$ python script.py 


bash-3.2$ python test_script.py & 7 
[1] 94779 
bash-3.2$ J 


Terminal 


$ python script.py Terminal 


Watch a video version of this post for better 
understanding: Video Link. 


After running a command (or script, etc.), most command-line users 
open a new terminal to run other commands. But that is never required. 


Here's how. 


When we run a program from the command line, by default, it runs in 
the foreground. This means you can't use the terminal until the 
program has been completed. 


However, if you add '&' at the end of the command, the program will 
run in the background and instantly free the terminal. 


This way, you can use the same terminal to run another command. 


To bring the program back to the foreground, use the 'fg' command. 
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A Simple Trick to Make The Most Out 
of Pivot Tables in Pandas 


Location Bombay Dubai London Moscow Munich New York Sydney Tokyo 


Company 


Amazon 

Apple 

Google 

IBM 

Microsoft 

seaborn Uber 


table pd.pivot_table(df,... 
sns.heatmap(table, T 


Amazon - 


Apple - 


Microsoft - 


Uber - 


Location 


Pivot tables are pretty common for data exploration. Yet, analyzing raw 
figures is tedious and challenging. What's more, one may miss out on 
some crucial insights about the data. 


Instead, enrich your pivot tables with heatmaps. The color encodings 
make it easier to analyze the data and determine patterns. 
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Why Python Does Not Offer True 
OOP Encapsulation 


avichawla.substack.com 


ie ear) a 
elf .public_attr = Maiti foulolaye # © underscores 
f._protected_attr = "I'm protected" # 1 underscore 
f.__private_attxr "T'm private" # 2 underscores 


my_obj = MyClass() 


my_obj.public_attr 
CTE pubiercm 


my_obj._protected_attr 
"T'm protected" 


my_obj._MyClass__private_attxr 
"I'm private" 


Using access modifiers (public, protected, and private) is fundamental 
to encapsulation in OOP. Yet, Python, in some way, fails to deliver true 
encapsulation. 


By definition, a public member is accessible everywhere. A private 
member can only be accessed inside the base class. A protected 
member is accessible inside the base class and child class(es). 


But, with Python, there are no such strict enforcements. 


Thus, protected members behave exactly like public members. What's 
more, private members can be accessed outside the class using name 
mangling. 


As a programmer, remember that encapsulation in Python mainly relies 
on conventions. Thus, it is the responsibility of the programmer to 
follow them. 
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Never Worry About Parsing Errors 
Again While Reading CSV with 
Pandas 


Icat file.csv 


Name , Amount 
Alice,$300 Separator appears 


Bob, $1\,000 = in value 
Charlie,$200 


pd.read_csv("file.csv") 


## ParserError: Error tokenizing data. C error: 
## Expected 2 fields in line 3, saw 3 


In [3]: import clevercsv 
clevercsv.read_dataframe("file.csv") 


Cue ls: 
Name Amount 


0 Alice $300 
1 Bob $1,000 


2 Charlie $200 ea avichawla.substack.com 


Pandas isn't smart (yet) to read messy CSV files. 


Its read_csv method assumes the data source to be in a standard tabular 
format. Thus, any irregularity in data raises parsing errors, which may 
require manual intervention. 


Instead, try CleverCSV. It detects the format of CSVs and makes it 
easier to load them, saving you tons of time. 


Find more info here: CleverCSV. 
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An Interesting and Lesser-Known 
Way To Create Plots Using Pandas 


from base64 import b64encode — . 
a O ayes avichawla.substack.com 


def create_hist (data): 
fig, ax = plt.subplots(figsize=(2, 0.5)) 
ax.hist(data, bins=10) 
ax.axis('off') 
plt.close(fig) 


img = BytesIO() # create Bytes Object 

fig.savefig(img) # Save Image to Bytes Object 

encoded = b64encode(img.getvalue()) # Encode object as base64 byte string 
decoded = encoded.decode('utf-8') # Decode to utf-8 

return f'<img src="data:png;base64,{decoded}">' # Return HTML tag 


df['Last 7 Days'] = df['Price History'].apply(create_line) 
df['Trade Volume'] = df['Price History'].apply(create_hist) 


HTML(df.to_html(escape=False) ) 


Price 
Name History Last 7 Days Price 10 Days Trade Volume 


Bitcoin [30400.0, ... WA WA A > 


Ethereum AANA AA, 


Litecoin 


Whenever you print/display a DataFrame in Jupyter, it is rendered 
using HTML and CSS. This allows us to format the output just like any 
other web page. 


One interesting way is to embed inline plots which appear as a column 
of a dataframe. 


In the above snippet, we first create a plot as we usually do. Next, we 
return the <img> HTML tag with its source as the plot. Lastly, we 
render the dataframe as HTML. 


Find the code for this tip here: Notebook. 
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Most Python Programmers Don't 
Know This About Python For-loops 


{num} ") 


modified num 


num 
num 
num 
num 
num 


avichawla.substack.com 


Often when we use a for-loop in Python, we tend not to modify the 
loop variable inside the loop. 


The impulse typically comes from acquaintance with other 
programming languages like C++ and Java. 


But for-loops don't work that way in Python. Modifying the loop 
variable has no effect on the iteration. 


This is because, before every iteration, Python unpacks the next item 
provided by iterable (range(5)) and assigns it to the loop variable 
(num). 


Thus, any changes to the loop variable are replaced by the new value 
coming from the iterable. 
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How To Enable Function Overloading 
In Python 


python interpreter 
only considers the 
latest definition D in 
add(1,2) 
of add() function TypeError: add() missing 1 


required positional argument: 'z' 


multipledispatch dispatch 


@dispatch(int, int) 
add(x, y): 
x y 


@dispatch(int, int, 
add(x, y, Z): 
xX +y 
dispatch decorator 
enables function 


overloading 


add(1,2) >) 


Python has no native support for function overloading. Yet, there's a 
quick solution to it. 


Function overloading (having multiple functions with the same name 
but different number/type of parameters) is one of the core ideas 
behind polymorphism in OOP. 


But if you have many functions with the same name, python only 
considers the latest definition. This restricts writing polymorphic code. 


Despite this limitation, the dispatch decorator allows you to leverage 
function overloading. 


Find more info here: Multipledispatch. 
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Generate Helpful Hints As You Write 
Your Pandas Code 


import dovpanda Ja avichawla.substack.com 


iter_df = df.iterrows() 


Cs) df.iterrows is not recommended. Essentially it is very similar to iterating the rows of 

2 the frames in a loop. In the majority of cases, there are better alternatives that utilize 
pandas' vector operation x 
Line 1: iter_df = df.iterrows() 


df["new_col"] = df.apply(apply_ func) 


D df.apply is not recommended. Essentially it is very similar to iterating the rows of the 
2 frames in a loop. In the majority of cases, there are better alternatives that utilize 
pandas' vector operation x 
Line 1: df{"new_col"] = df.apply(apply_func) 


merged_df = pd.concat((df, df)) 


‘ All dataframes have the same columns and same number of rows. Pay attention, your 
*~ axis is 0 which concatenates vertically x 
Line 1: merged_df = pd.concat((df, df)) 


(os) After concatenation you have duplicated indices - pay attention 
e 


Line 1: merged_df = pd.concat((df, df)) 


When manipulating a dataframe, at times, one may be using 
unoptimized methods. What's more, errors introduced into the data can 
easily go unnoticed. 


To get hints and directions about your data/code, try Dovpanda. It 
works as a companion for your Pandas code. As a result, it gives 
suggestions/warnings about your data manipulation steps. 


P.S. When you will import Dovpanda, you will likely get an error. 
Ignore it and proceed with using Pandas. You will still receive 
suggestions from Dovpanda. 


Find more info here: Dovpandas. 
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Speedup NumPy Methods 25x With 
Bottleneck 


bottleneck as bn 
rt numpy as np 


np.random.random( (1000, 


@ numpy.py è bottleneck.py 


> np.sum(arr) bn.nansum(arr) 
# Run-time: 870 ys ## Run-time: 33.9 us (25x 


> np.mean(arr) bn.nanmean(arr) 
Run-time: 477 ys Ht Run-time: 21 us (22x 


> np.std(arr) > bn.nanstd(arr) 


t Run-time: 687 us Ht Run-time: 175 ys 


np.median(arr) bn.nanmedian(arr) 
Run-time: 1.58 ms ## Run-time: 0.43 ms 


np.max (arr) bn.nanmax (arr) 
Run-time: 1.26 ms Ht Run-time: 0.46 


NumPy's methods are already highly optimized for performance. Yer, 
here's how you can further speed them up. 


Bottleneck provides a suite of optimized implementations of NumPy 
methods. 


Bottleneck is especially efficient for arrays with NaN values where 
performance boost can reach up to 100-120x. 


Find more info here: Bottleneck. 
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Visualizing The Data Transformation 
of a Neural Network 


Transform D 


If you struggle to comprehend how a neural network learns complex 
non-linear data, I have created an animation that will surely help. 


Please find the video here: Neural Network Animation. 


For linearly inseparable data, the task boils down to projecting the data 
to a space where it becomes linearly separable. 


Now, either you could do this manually by adding relevant features 
that will transform your data to a linear separable form. Consider 
concentric circles for instance. Passing a square of (x,y) coordinates as 
a feature will do this job. 


But in most cases, the transformation is unknown or complex to figure 
out. Thus, non-linear activation functions are considered the best bet, 

and a neural network is allowed to figure out this "non-linear to linear 
transformation" on its own. 


As shown in the animation, if we tweak the neural network by adding a 
2D layer right before the output, and visualize this transformation, we 
see that the neural network has learned to linearly separate the data. 
We add a layer 2D because it is easy to visualize. 


This linearly separable data can be easily classified by the last layer. 
To put it another way, the last layer is analogous to a logistic 
regression model which is given a linear separable input. 


The code for this visualization experiment is available here: GitHub. 
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Never Refactor Your Code Manually 
Again. Instead, Use Sourcery! 


© © 2 my_code.py 


def is_special_number (number): 
ite number. — vE 
return True 
elif number = 18: 
return True 
else: 
return False 


© © Command Line 


$ sourcery review --in-place my_code.py 


© © @ my_code.py 


def is_special_number (number): 
return number in [7, 18] 


Refactoring code is an important step in pipeline development. Yet, 
manual refactoring takes additional time for testing as one might 
unknowingly introduce errors. 


Instead, use Sourcery. It's an automated refactoring tool that makes 
your code elegant, concise, and Pythonic in no time. 


With Sourcery, you can refactor code from the command line, as an 
IDE plugin in VS Code and PyCharm, pre-commit, etc. 


Find more info here: Sourcery. 
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Draw The Data You Are Looking For 
In Seconds 


In [2]: from drawdata import draw_scatter 
draw_scatter() 


Out[(2): 


reset copyjson copycsv downloadjson downloadcsyv OA OB ®C OD 


OOS 
. 


in linkedin.com/in/avi-chawla 


Please watch a video version of this post for better 
understanding: Video Link. 


Often when you want data of some specific shape, programmatically 
generating it can be a tedious and time-consuming task. 


Instead, use drawdata. This allows you to draw any 2D dataset in a 
notebook and export it. Besides a scatter plot, it can also create 
histogram and line plot 


Find more info here: Drawdata. 
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Style Matplotlib Plots To Make Them 
More Attractive 


in linkedin.com/in/avi-chawla 


Select from 


all_styles plt.style.available all styles 
style ~ 


plt.style.context(style): 


default Solarize_Light2 _mpl-gallery 


oo Poe 
CoH a 
5 


0.0 


classic fivethirtyeight ggplot 
. . i ajte ee % 
TE Ra Ea eA 
wa &e eave 

"es 


-i0 -05 00 0. r ; i i -1.0 -0.5 0.0 05 


seaborn nature seaborn-whitegrid 


Matplotlib offers close to 50 different styles to customize the plot's 
appearance. 


To alter the plot's style, select a style from plt.style.available and 
create the plot as you originally would. 


Find more info about styling here: DOCS. 
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Speed-up Parquet I/O of Pandas by 5x 


pandas.py 


file.parquet: 
pandas pd 32M rows 


df pd.read_parquet("file.parquet") 


in-tıme: 41s 


fastparquet.py 


.” 5x Faster 


fastparquet ParquetFile 


ParquetFile('file.parquet' ) 
pf.to_pandas() 
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Dataframes are often stored in parquet files and read using Pandas' 
read_parquet() method. 


Rather than using Pandas, which relies on a single-core, use 
fastparquet. It offers immense speedups for I/O on parquet files using 
parallel processing. 


Find more info here: DOCS. 
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40 Open-Source Tools to Supercharge 
Your Pandas Workflow 


uperchar 
na 


Pandas receives over 3M downloads per day. But 99% of its 


users are not using it to its full potential. 


I discovered these open-source gems that will immensely supercharge 
your Pandas workflow the moment you start using them. 


Read this list here: https://avichawla.substack.com/p/37-open-source- 
tools-to-supercharge-pandas. 


117 


O 
8 , 
%o°\0 avichawla.substack.com 


Stop Using The Describe Method in 
Pandas. Instead, use Skimpy. 


skimpy 
skim(df) 


skimpy summary 
Data Summary Data Types Categories 


Number of rows float64 class 
Number of columns category location 
datetime64 
int64 

bool 


number 


ee CC 


category 


ee Ce N 


False 
False 


datetime 


an CC a 
1992- 01- -05 2023-03-04 
string 
COL 
C a L 
bool 
CO a a 


End 
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Supercharge the describe method in Pandas. 


Skimpy is a lightweight tool for summarizing Pandas dataframes. In a 
single line of code, it generates a richer statistical summary than the 
describe() method. 


What's more, the summary is grouped by datatypes for efficient 
analysis. You can use Skimpy from the command line too. 


Find more info here: DOCS. 
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The Right Way to Roll Out Library 
Updates in Python 


my_library.py 


Add 
deprecated deprecated decorator 


@deprecated( "old_function will be \ 
deprecated in the next \ 


release. Use new_function.") 
old_function(): 


project.py 


Prints 


warning old_value = old_function() 
DeprecationWarning: Call to deprecated function 
old_function. (old_function will be deprecated 
in the next release. Use new_function. ) 


in! linkedin.com/in/avi-chawla 


While developing a library, authors may decide to remove some 
functions/methods/classes. But instantly rolling the update without any 
prior warning isn't a good practice. 


This is because many users may still be using the old methods and they 
may need time to update their code. 


Using the deprecated decorator, one can convey a warning to the 
users about the update. This allows them to update their code before it 
becomes outdated. 


Find more info here: GitHub. 
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Simple One-Liners to Preview a 
Decision Tree Using Sklearn 


my tree = DecisionTreeClassifier() 
my _tree.fit(X, y) 


from sklearn.tree import plot_tree, export_text 


plot_tree(my tree, feature_names=features, 
class_names=classes, filled=True) Method 1 


petal_width <= 0.8 
gini = 0.667 
samples = 150 
value = [50, 50, 50] 
class = setosa 


petal_width <= 1.75 
gini = 0.5 
samples = 100 
value = [0, 50, 50] 
class = versicolor 


print(export_text(my_tree, feature_names=features) ) M eth od 2 


|--- petal_width <= 0.80 
|--- class: setosa 
|--- petal_width > 0.80 
| |--- petal_width <= 1.75 
| | |--- class: versicolor 
| |--- petal_width > 1.75 
| | |--- class: virginica 
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If you want to preview a decision tree, sklearn provides two simple 
methods to do so. 


1. plot tree creates a graphical representation of a decision tree. 


2. export text builds a text report showing the rules of a decision 
tree. 


This is typically used to understand the rules learned by a decision tree 
and gaining a better understanding of the behavior of a decision tree 
model. 
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Stop Using The Describe Method in 
Pandas. Instead, use Summarytools. 


summarytools 
dfSummary (iris_df) 


Dimensions: 150 x 5 


Duplicates: 1 


No Variable 


sepal_length 
[float64] 


sepal_width 
[float64] 


petal_length 
[float64] 


petal_width 
[float64] 


species 
[object] 
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Stats / Values 


Mean (sd) : 5.8 (0.8) 
min < med < max: 
4.3 < 5.8 < 7.9 

IQR (CV) : 1.3 (7.1) 


Mean (sd) : 3.1 (0.4) 
min < med < max: 
2.0 < 3.0 < 4.4 

IQR (CV) : 0.5 (7.0) 


Mean (sd) : 3.8 (1.8) 
min < med < max: 
1.0 < 4.3 < 6.9 

IQR (CV) : 3.5 (2.1) 


Mean (sd) : 1.2 (0.8) 
min < med < max: 
0.1 < 1.3 < 2.5 

IQR (CV) : 1.5 (1.6) 


1. setosa 
2. versicolor 
3. virginica 


dfSummary 


Freqs / (% of Valid) 


35 distinct values 


23 distinct values 


43 distinct values 


22 distinct values 


50 (33.3%) 
50 (33.3%) 
50 (33.3%) 


Graph 


(0.0%) 


0 
(0.0%) 


0 
(0.0%) 


Summarytools is a simple EDA tool that gives a richer summary than 
describe() method. In a single line of code, it generates a standardized 
and comprehensive data summary. 


The summary includes column statistics, frequency, distribution chart, 


and missing stats. 


Find more info here: Summary Tools. 
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Never Search Jupyter Notebooks 
Manually Again To Find Your Code 


Search in 
all notebooks 


$ nbgrep "import =" 


Terminal 


Benchmark.ipynb : cell 3:line 1 : import os 
modin.ipynb : Coce : import os 
kmeans.ipynb : CeCe : import os 
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Have you ever struggled to recall the specific Jupyter notebook in 
which you wrote some code? Here's a quick trick to save plenty of 
manual work and time. 


nbcommands provides a bunch of commands to interact with Jupyter 
from the terminal. 


For instance, you can search for code, preview a few cells, merge 
notebooks, and many more. 


Find more info here: GitHub. 
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F-strings Are Much More Versatile 
Than You Think 


Formatting lin linkedin.com/in/avi-chawla 


yxint(f"3 decimals: {number: I 3 decimals: 205.000 
print(f"5 digits: {number R) 5 digits: 00205 


orint(f"scientific: {number:e}") scientific: 2.050000e+02 


Converting 


print(f"binary: {number:b}") binary: 11001101 
print(f"hex: {number y) hex: Oxcd 


print(f"octal: {number:o}") octal: 315 


Here are 6 lesser-known ways to format/convert a number using f- 
strings. What is your favorite f-string hack? 
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Is This The Best Animated Guide To 
KMeans Ever? 


ans Clustering 


Algorithm 


By: Avi Chawla 


Have you ever struggled with understanding KMeans? How it works, 
how are the data points assigned to a centroid, or how do the centroids 
move? 


If yes, let me help. 


I created a beautiful animation using Manim to help you build an 
intuitive understanding of the algorithm. 


Please find this video here: Video Link. 
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An Effective Yet Underrated 
Technique To Improve Model 
Performance 


import imgaug.augmenters as iaa 


seq = iaa.Sequential([ 
iaa.Fliplr(0.5), # 
iaa.Rotate((-4¢ 


samall) 


images_aug = seq(images=images) 


Robust ML models are driven by diverse training data. Here's a simple 
yet highly effective technique that can help you create a diverse dataset 
and increase model performance. 


One way to increase data diversity is using data augmentation. 


The idea is to create new samples by transforming the available 
samples. This can prevent overfitting, improve performance, and build 
robust models. 


For images, you can use imgaug (linked in comments). It provides a 
variety of augmentation techniques such as flipping, rotating, scaling, 
adding noise to images, and many more. 


Find more info: Imgaug. 
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Create Data Plots Right From The 
Terminal 


>> from bashplotlib.histogram import plot_hist 


>>> np_arr = np.random.normal(size=100@) 
>>> plot_hist(np_arr, bincount=5@) 


54 | 
51| 
48 | 
45| 
43| 
40| 
37| 
RYA 
31| 


29 | 
26| 


oo 
ooo 
ooo 
0000 000 
O000000000 OO 
oo ooo0o000000000 
oo O000000000000 
O000000000000000 
Oo000000000000000 
Oo000000000000000000 
O000000000000000000 
oOo00000000000000000000 


ooo0o000000000000000000 O 
ooo0o000000000000000000 O 
oooo0000000000000000000000 O 
ooo0o000000000000000000000000 
oo oo0o0000000000000000000000000 


Oo 


o oo0o000000000000000000000000000000000000 


ooo0o000000000000000000000000000000000000000 O 


Visualizing data can get tough when you don't have access to a GUI. 
But here's what can help. 


Bashplotlib offers a quick and easy way to make basic plots right from 
the terminal. Being pure python, you can quickly install it anywhere 
using pip and visualize your data. 


Find more info here: Bashplotlib. 
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Make Your Matplotlib Plots More 
Professional 


matplotlib 


< 
= 
a 
E 
o 
= 
5 
O 


scienceplots.py 


© 


sciencepLlots 
plt.style.use('science') 


Current (uA) 
=) 
iN 


= 
w 


2 
= 


L0 LI ë 12 
Voltage (mV) 
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The default matplotlib plots are pretty basic in style and thus, may not 
be the apt choice always. Here's how you can make them appealing. 


To create professional-looking and attractive plots for presentations, 
reports, or scientific papers, try Science Plots. 


Adding just two lines of code completely transforms the plot's 
appearance. 


Find more info here: GitHub. 
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37 Hidden Python Libraries That Are 
Absolute Gems 


37 Gem Python 
ce Libraries m 


Waiting To Be 
Discovere 


I reviewed 1,000+ Python libraries and discovered these hidden gems I 
never knew even existed. 


Here are some of them that will make you fall in love with Python' and 
its versatility (even more). 


Read this list here: https://avichawla.substack.com/p/gem-libraries. 
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Preview Your README File Locally 
In GitHub Style 


Terminal 


README.md 


a 
} PAN 


act 
T 


Q) 


Preview ae a eT 


Daily Dose of Data Science 


R EA D M E Daily Dose of Data Science is a publication on Substack that brings together intriguing frameworks, libraries, 


technologies, and tips that make the life cycle of a Data Science project effortless. 


This repository is a collection of all the code snippets presented in my publication. If you want to receive these tips 


i n b r Ow S e r in your mailbox daily, you can subscribe to my Substack newsletter. 


Run These Code Snippets on Your Local Machine 
To dowmload the tips listed here, you can clone this repo. 


git clone nttps://github. con/ChawlaAvi/Dally—Dose-of-Data-Sclence 
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Please watch a video version for better understanding: Video Link. 


Have you ever wanted to preview a README file before committing it 
to GitHub? Here's how to do it. 


Grip is a command-line tool that allows you to render a README file 
as it will appear on GitHub. This is extremely useful as sometimes one 
may want to preview the file before pushing it to GitHub. 


What's more, editing the README instantly reflects in the browser 
without any page refresh. 


Read more: Gri p. 
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Pandas and NumPy Return Different 
Values for Standard Deviation. Why? 


Ə Std-dev.py 


numpy np 
pandas pd 


np.arange(20) 
pd.DataFrame(X) 


nt(f"NumPy : {np.std(X)}") 
inetr pamase LAF steye) 


different 


NumPy 
output 


Pandas: 5 
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Pandas assumes that the data is a sample of the population and that the 
obtained result can be biased towards the sample. 


Thus, to generate an unbiased estimate, it uses (n-1) as the dividing 
factor instead of n. In statistics, this is also known as Bessel's 
correction. 


NumPy, however, does not make any such correction. 


Find more info here: Bessel’s correction. 
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Visualize Commit History of Git Repo 
With Beautiful Animations 
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Terminal 


$ git-story 


Oad7e7 a08c 7cc4a a £001c0 


As the size of your project grows, it can get difficult to comprehend 
the Git tree. 


Git-story is a command line tool to create elegant animations for your 
git repository. 


It generates a video that depicts the commits, branches, merges, HEAD 
commit, and many more. Find more info in the comments. 


Please watch a video version of this post here: Video. 


Read more: Git-sto ry. 
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Perfplot: Measure, Visualize and 
Compare Run-time With Ease 


in linkedin.com/in/avi-chawla 
perf-plot.py 


perfplot 


perfplot.show( 


. random. rand < Create Data 


amnpicdianmalf 

npes tack Clap malNEm E . 
NPaVStack fap al) ili Functions 
: np.column_stack([a, al) 


Input Size 
Range 


Here's an elegant way to measure the run-time of various Python 
functions. 


Perfplot is a tool designed for quick run-time comparisons of many 
functions/algorithms. 


It extends Python's timeit package and allows you to quickly visualize 
the run-time in a clear and informative way. 


Find more info: Perfplot. 
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This GUI Tool Can Possibly Save You 
Hours Of Manual Work 


=| Visual Python 


# Visual Python: Data Analysis > Import 
import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 
Smatplotlib inline Logic 
# Visual Python: Data Analysis > File a 
df = pd.read_csv('./dummy_data.csv') 


ae Data Analysis 


# Visual Python: Visualization > Plotly 
fig = px.scatter(df, x='Employee Rating’, y='Employee Salary’, color='Employment_ 
fig.show() 


Employment_Status 
* Intern Visualization 


e Full Time 
eloloclulyula 


Machine Learning 


ny (ER ea ey a 
Data Sets $ Data Split $ Data Prep f AutoML Regressor $ Classifier 
ef+ialale= 
Qustering J Dimension $ Fit/Predict $ Model Info ff Evaluation 


Employee_Salary 
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Please watch a video version of this post for better 
understanding: Link. 


This is indeed one of the coolest and most useful Jupyter notebook- 
based data science tools. 


Visual Python is a GUI-based python code generator. Using this, you 
can easily eliminate writing code for many repetitive tasks. This 
includes importing libraries, I/O, Pandas operations, plotting, etc. 


Moreover, with the click of a couple of buttons, you can import the 
code for many ML-based utilities. This covers sklearn models, 
evaluation metrics, data splitting functions, and many more. 


Read more: Visual Python. 
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How Would You Identify Fuzzy 
Duplicates In A Data With Million 
Records? 


First Name Last_Name Address in linkedin.com/in/avi-chawla 
Daniel Lopez 719 Greene St. East Rhonda 9371184929 
Daniel NaN 719 Green Street East Rhoda 93711-84929 


. 
Alan Martin 982 Carol Harbors Apart. 7481919235 Data with 


Alan Martin NaN 982 Carol Aparments 748-191-9235 


Philip Owens 2578 Banks Ford 869-6922x9581 y 


Shannon White USCGC Molina (150)082-7982 d u pl icates 


Julia Anderson 09162 Mason Mnts. 698-1590x3236 


Juliya Anderrson 9162 Mason Street Mountain 69815903236 


Command Line 


$ csvdedupe input.csv \ 
--field_names First_Name Last_Name Address Phone \ 
--output_file output.csv 


Cluster ID First Name Last_Name Address 
Daniel Lopez 719 Greene St. East Rhonda 9371184929 
Daniel nan 719 Green Street East Rhoda 93711-84929 
Alan Martin 982 Carol Harbors Apart. 7481919235 
M arked Alan Martin nan 982 Carol Aparments 748-191-9235 
A Philip Owens 2578 Banks Ford 869-6922x9581 
D u pl icates Shannon White USCGC Molina (150)082-7982 
Julia Anderson 09162 Mason Mnts. 698-1590x3236 


Juliya Anderrson 9162 Mason Street Mountain 69815903236 


Imagine you have over a million records with fuzzy duplicates. How 
would you identify potential duplicates? 


The naive approach of comparing every pair of records is infeasible in 
such cases. That's over 10^12 comparisons (n^2). Assuming a speed of 
10,000 comparisons per second, it will take roughly 3 years to 
complete. 


The csvdedupe tool (linked in comments) solves this by cleverly 
reducing the comparisons. For instance, comparing the name “Daniel” 
to “Philip” or “Shannon” to “Julia” makes no sense. They are 
guaranteed to be distinct records. 
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Thus, it groups the data into smaller buckets based on rules. One rule 
could be to group all records with the same first three letters in the 
name. 


This way, it drastically reduces the number of comparisons with great 
accuracy. 


Read more: CSVdedupe. 
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Stop Previewing Raw DataFrames. 
Instead, Use DataTables. 


In [l1]: import pandas as pd 
from jupyter_datatables import init_datatables_mode 


In [2]: init_datatables_mode() 


In [3]: pd.read_esv("employee dataset.csv") 


prt || csv || Z [Show 10 entries 


wm Ddall oalon UWA OOO didi, 
Wiliam Stein Andrade LLC Melissafurt Mali 


LL North 


Sales promotion account executive 
Radiographer, therapeutic 
Sales promotion account executive 


Trading standards officer 


Showing 1 to 10 of 69 entries (filtered from 1,000 total entries) Previous | 1 2 3 4 5 6 7 Next 
boxes) 


‘Sample size: 1,000 out of 1,000 
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After loading any dataframe in Jupyter, we preview it. But it hardly 
tells anything about the data. 


One has to dig deeper by analyzing it, which involves simple yet 
repetitive code. 


Instead, use Jupyter-DataTlables. 


It supercharges the default preview of a DataFrame with many common 
operations. This includes sorting, filtering, exporting, plotting column 
distribution, printing data types, and pagination. 


Please view a video version here for better understanding: Post 


Link. 
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WA Single Line That Will Make Your 
Python Code Faster 


func_without_numba(): 
result [] 
a re 

b 


(a+b) %11 ji 
result.append((a,b)) 


func_without_numba() 


numba.py 


numba njit 
anjit 
~33x Faster func_with_numba(): 


func_with_numba() 
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If you are frustrated with Python's run-time, here's how a single line 
can make your code blazingly fast. 


Numba is a just-in-time (JIT) compiler for Python. This means that it 
takes your existing python code and generates a fast machine code (at 
run-time). 


Thus, post compilation, your code runs at native machine code speed. 
Numba works best on code that uses NumPy arrays and functions, and 
loops. 


Get Started: Numba Guide. 
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Prettify Word Clouds In Python 


wordcloud WordCloud 


wc = WordCloud().generate(text) 


design 


support 
nodulesspes 
readability 


PIL : Image 


mask Image.open("pylogo.png") 


library 


language) 


we = WordCloud( mask, 


programming 


functional 
oh 


aie) 
we. generate(text) 
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If you use word clouds often, here's a quick way to make them prettier. 


In Python, you can easily alter the shape and color of a word cloud. By 
supplying a mask image, the resultant world cloud will take its shape 
and appear fancier. 


Find more info here: Notebook Link. 
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How to Encode Categorical Features 


With Many Categories? 


category_encoders ce 
enc = ce.BinaryEncoder(cols=['class']) 


enc.fit_transform(data["class"]) 


CUESSLO CllASSl CellASS_7 


gender class 


Male 
Female 
Male 
Female 


Female 


A 


We often encode categorical columns with one-hot encoding. But the 
feature matrix becomes sparse and unmanageable with many 


categories. 


The category-encoders library provides a suite of encoders specifically 
for categorical variables. This makes it effortless to experiment with 


various encoding techniques. 


For instance, I used its binary encoder above to represent a categorical 


column in binary format. 


Read more: Documentation. 
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Calendar Map As A Richer 
Alternative to Line Plot 


date message_count 
2022-01-01 
plotly_caltplot calplot 2022-01-02 
2022-01-03 


calplot(chat_df, 2022-01-04 


"date" f 
"message_count") 


2022-01-05 


Work Group sina 


Mon H wa m | OI Bee pennan 
Tue l- aa E E EE HE 


Wed -E [Ez 
Thu E a p mE E un 
Fri J BOS as En E SEE ps 
Sat! | 
HA 
73 


f 
Fay A Mp, Aor, 


Sun 
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Ever seen one of those calendar heat maps? Here's how you can create 
one in two lines of Python code. 


A calendar map offers an elegant way to visualize daily data. At times, 
they are better at depicting weekly/monthly seasonality in data instead 
of line plots. For instance, imagine creating a line plot for “Work 
Group Messages” above. 


To create one, you can use "plotly_calplot". Its input should be a 
DataFrame. A row represents the value corresponding to a date. 


Read more: Plotly Calplot. 
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10 Automated EDA Tools That Will 
Save You Hours Of (Tedious) Work 


Most steps in a data analysis task stay the same across projects. Yet, 
manually digging into the data is tedious and time-consuming, which 
inhibits productivity. 


Here are 10 EDA tools that automate these repetitive steps and profile 
your data in seconds. 


Please find this full document in my LinkedIn 
post: Post Link. 
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Why KMeans May Not Be The Apt 
Clustering Algorithm Always 


e Class A 
è Class B 


Incorrect clusters Prediction 


Kmeans 


sklearn cluster 
cluster.KMeans(2).fit(X) 


Prediction 
e ClassA 
e Class B 


DBSCAN 


sklearn cluster 
cluster.DBSCAN().f1it(X) 
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KMeans is a popular clustering algorithm. Yet, its limitations make it 
inapplicable in many cases. 


For instance, KMeans clusters the points purely based on locality from 
centroids. Thus, it can create wrong clusters when data points have 
arbitrary shapes. 


Among the many possible alternatives is DBSCAN, which is a density- 
based clustering algorithm. Thus, it can identify clusters of arbitrary 
shape and size. 


This makes it robust to data with non-spherical clusters and varying 
densities. Find more info in the comments. 


Find more here: Sklearn Guide. 
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Converting Python To LaTeX Has 
Possibly Never Been So Simple 


import latexify 
import math 


@latexify.function Add decorator 
def roots(a, b, c): 
return (-b + math.sqrt(b**2 - 4*a*c)) / (2*a) 


roots 


-b+ yF Aac 


roots(a, b, c) = z 
a 


@latexify.function 
def fib(n): 
if n<2: 
return 1 
else: 
return fib(n-1) + fib(n-2) 


fib 
i. ifn <2 
fib (n — 1) + fib (n — 2), otherwise 
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fib(n) = { 


If you want to display python code and its output as LaTeX, try 
latexify_py. With this, you can print python code as a LaTeX 
expression and make your code more interpretable. 


What’s more, it can also generate LaTeX code for python code. This 
saves plenty of time and effort of manually writing the expressions in 
LaTeX. 


Find more info here: Repository. 
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Density Plot As A Richer Alternative 
to Scatter Plot 


scatter.py 


sns.scatterplot(x=df.x, 
df.y) 


density.py 


sns.kdeplot(x=df.x, 
df.y) 
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Scatter plots are extremely useful for visualizing two sets of numerical 
variables. But when you have, say, thousands of data points, scatter 
plots can get too dense to interpret. 


A density plot can be a good choice in such cases. It depicts the 
distribution of points using colors (or contours). This makes it easy to 
identify regions of high and low density. 


Moreover, it can easily reveal clusters of data points that might not be 
obvious in a scatter plot. 


Read more: DOCS. 
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30 Python Libraries to (Hugely) Boost 
Your Data Science Productivity 


30 Python Libraries 


Here's a collection of 30 essential open-source data science libraries. 
Each has its own use case and enormous potential to skyrocket your 
data science skills. 


I would love to know the ones you use. 


Please find this full document in my LinkedIn post: Post Link. 
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Sklearn One-liner to Generate 
Synthetic Data 


© © Ə dummy _data.py 


from sklearn.datasets import make_classification 


#Ħ create data 

X, y = make_classification(n_samples=50, 
n_features=4, 
n_classes=2) 


SPENE OO) 
array ([[-0.36 


T T = . 
=Yallamealalaalilale hk 


Often for testing/building a data pipeline, we may need some dummy 
data. 


With Sklearn, you can easily create a dummy dataset for regression, 
classification, and clustering tasks. 


More info here: Sklearn Docs. 
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Label Your Data With The Click Of A 
Button 


from ipyannotate import annotate 


from ipyannotate.buttons import ValueButton as Button 


annotation = annotate(images_data, # data 
buttons=[Button('Dog'), # label buttons list 
Button('Cat')]) 
annotation 


In [6]: labels = [task.value for task in annotation.tasks] # get labels 
labels 


Out[6]: ['Cat', 'Dog', 'Dog', ‘Cat'] 
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Often with unlabeled data, one may have to spend some time 
annotating/labeling it. 


To do this quickly in a jupyter notebook, use ipyannotate. With this, 
you can annotate your data by simply clicking the corresponding 
button. 


Read more: ipyannotate. 
Watch a video version of this post on LinkedIn: Post Link. 
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Analyze A Pandas DataFrame 
Without Code 


import pandas as pd 
from pandasgui import show 


df = pd.read_csv( "Dummy Dataset.csv") 


‘Company_Name 
Employee_lob_Titie 
Employee_City 
Employee_Country 
Employee_Salary 
Employment_Status 
Employee_Rating 
Credits 


show(df) 


PandasGul 


DataFrame Statistics Grapher Reshaper 


i 


o 
1 
2 
3 
a 
5 
6 
7 
8 
9 
10 
n 


Name 
Michael Clark 
Edwin Smith 
Leslie Donovan 
Phyllis King 
Joshua Patterson 
Cheyenne Torres 
Jonathan Chen 
Scott Powell 
Theresa Doyle 
kristen Harrington 
Joseph Hunt 
Tracy King 

Julie Moses 

Tanya Cross 
Christopher Berry 
Rebecca Jimenez 
Adam Sampson 
Austin Smith 

John Edwards 
Theresa Espinoza 
Lisa Mccall 

Scott Escobar 
Christopher Callahan 


Company_Name 
James and Sons 

Baker, Allen and Edwards 
Nelson-Li 

Taylor-Ramos 
Thomas-Spencer 
Nelson-Li 

WallaceSmith and Shepard 
Scott ine 

Bullock-Carrillo 

James and Sons 

Wallace Smith and Shepard 
Bullock-Carrilo 
Bullock-Carrillo 

James and Sons 
Marshali-Holloway 

White, Meclain and Cobb 
James and Sons 
Marshall-Holloway 
Taylor-Ramos 

James and Sons 
Bullock-Carrillo 
Nichols-James 
Thomas-Spencer 


Employee_Job_Title 
Regulatory affairs officer 
Trading standards officer 
Naval architect 

Make 

Retail merchandiser 
Actuary 

Actuary 

‘Administrator 

Production engineer 
Sales promotion account executive 


Armed forces logistics/support/administrative officer 


Garment/textile technologist 
Energy manager 

Actuary 

Actuary 

Diplomatic Services operational officer 
Investment banker,corporate 
Administrator 

Actuary 

Energy manager 

Ergonomist 

Trading standards officer 
Trading standards officer 


Employee. City 
Ricardomouth 
Whitakerbury 
New Russellton 
Ricardomouth 
Wardfort 
Kristaburgh 


New Cindychester 


Ricardomouth 
Ricardomouth 
Kristaburgh 
Aiiciafort 


New Cindychester 


North Molissafurt 
Ricardomouth 
Ricardomouth 
West Jamesview 
West Jamesview 
New Russellton 


New Russellton 


New Cindychester 


West Jamesview 
Kristaburgh 
North Melissafurt 
Ricardomouth 


Employee_Country 
Western Sahara — 
Singapore 

Nive 

Tokelau 

Croatia 

Thailand 

Equatorial Guinea 
Palau 

Cape Verde 

United States of Ameri 
Equatorial Guinea 


Nepal 


Guyana 
Saint Pierre and Miquel 
Kyrgyz Republic 

Mali 


Jamaica 


Tara Srecan Bullock-Carrillo_ Naval architect 
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If you want to analyze your dataframe in a GUI-based application, try 
Pandas GUI. It provides an elegant GUI for viewing, filtering, sorting, 
describing tabular datasets, etc. 


What's more, using its intuitive drag-and-drop functionality, you can 
easily create a variety of plots and export them as code. 


Watch a video version of this post on LinkedIn: Post Link. 
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Python One-Liner To Create Sketchy 
Hand-drawn Plots 


indent plot 
under xkcd() 


pute xkedi@i: A 
DiteabDam@ne:. 


8 -| um -| zeroan A 
7 -| =m Product B 
Hand-drawns Ilil l 


in TE 


xkcd comic is known for its informal and humorous style, as well as 
its stick figures and simple drawings. 


Creating such visually appealing hand-drawn plots is pretty simple 
using matplotlib. Just indent the code in a plt.xkcd() context to 
display them in comic style. 


Do note that this style is just used to improve the aesthetics of a plot 
through hand-drawn effects. However, it is not recommended for 
formal presentations, publications, etc. 


Read more: DOCS. 
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70x Faster Pandas By Changing Just 
One Line of Code 


O o e@ Pandas.py 


import pandas as pd 
data = "file.csv" ## 2M Rows 


df pd.read_csv(data) 
Ht 3.6 sec 


pd.concat([df for _ in range(20)]) 
HH 7.1 sec 


ee 2 Modin.py 


import modin.pandas as pd 
data = "file.csv" ## 2M Rows 


pd.read_csv(data) 


pd.concat([df for _ in range(20)]) 


It is challenging to work on large datasets in Pandas. This, at times, 
requires plenty of optimization and can get tedious as the dataset 
grows further. 


Instead, try Modin. It delivers instant improvements with no extra 
effort. Change the import statement and use it like the Pandas API, 
with significant speedups. Find more info in the comments. 


Read more: Modin Guide. 
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An Interactive Guide To Master 
Pandas In One Go 


Arithmetic 
I/0 

_ Iterate Create 

Pivot Shape df . shape 
D # Rows len(df) 
@ Append df .head(N) Top N Rows 

Print N rows ae = = 

Merge ~ df.tail(N) Bottom N Rows 

Delete 


df .describe() 


DataFrame Ged) 
H df .mean()/df .median()/ 
Miscellaneous Info df .mode()/df.std() 


Data Types df .dtypes 
Group - 

df.col1.uniqueO 
Unique 


Change Method ~ df.col1.nuniqueQ 
@ : à 


Columns df . columns 


Time 
Series 
® 


Index df. index 
df. rename({"idx1" : "newidx1"}) 
Convert df .rename(lambda x:x+1) 
F \ Rename df.rename({"col1":"newcol1"}, axis=1) 
Display ean df.rename(lambda: x:x+"_col" axis=1) 
Categorical df.add_suffix('_col') 
jeee df .add_prefixC'col_') 


Plot 


Sort 
@ 
DF Subset a 


Here’s a mind map illustrating Pandas Methods on one page. How 
many do you know :) 


® Load/Save 

® DataFrame info 
® Filter 

® Merge 

® Time-series 

® Plot 


® and many more, in a single map. 


Find the full diagram here: Pandas Mind Map. 
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Make Dot Notation More Powerful in 
Python 


myclass.py 


: ] (self, Length): 
self._side = Length 


@property ——> Getter 
ef side(self): 
' self._side 
@side.setter ———> Setter 
sSide(self, Length): 
- Length<0: 
j ValueError("Side cannot be negative") 


Ff. side = length 


s = Square(10) 


s.side # Getter 


2 # Setter (with dot) 


xr: Side cannot be negative 


Dot notation offers a simple and elegant way to access and modify the 
attributes of an instance. 


Yet, it is a good programming practice to use the getter and setter 
method for such purposes. This is because it offers more control over 
how attributes are accessed/changed. 


To leverage both in Python, use the @property decorator. As a result, 
you can use the dot notation and still have explicit control over how 
attributes are accessed/set. 


152 


@ 
x | 
PAL avichawla.substack.com 


The Coolest Jupyter Notebook Hack 


In [1]: import numpy as np 


In [2]: Op area 1,2,2)) 


array([1, 2, 3]) 


In [3]: _2 


Out[3]: array([1, 2, 3]) 


In [4]: Out[2] 


Out[4]: array([1, 2, 3]) 


In [5]: _oh[2] 


CURSO EMER OUT 5]: array({1, 2, 3]) 


Have you ever forgotten to assign the results to a variable in Jupyter? 
Rather than recomputing the result by rerunning the cell, here are three 
ways to retrieve the output. 


1) Use the underscore followed by the output-cell-index. 


2/3) Use the Out or _oh dict and specify the output-cell-index as the 
key. 
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Create a Moving Bubbles Chart in 
Python 


z 
= jupyter moving bubbles Last Checkpoint: 19 minutes ago (autosaved) 16 e 43 
Fie Eat Vew inset Col Keme Widgets Heip Snippets e 
B+ KOD t+ PR BC Dm Cot YB BCreate new Mtl SE veo rast 
> O rwan 


ta 2): 3 |state of sample. id at ditt; timestamps ” 
Oat{ 2}: 
Gatetime sample id 

© 2000-05-01 05:51:31 

1 2000-01-23 1807:02 

2 2000-01-190821:28 

3 2000-01-15 0421343 

4 2000-01-26 1525:00 


9904 2000-01-02 08:36:49 
9995 2000-01-15 13:13:15 
9996 2000-01-21 06:04:56 
9997 2000-01-31 0821:28 
9998 2000-01-14 04:46:11 


9999 rows x 3 columns 


In [3]: 1 from d3blocks import D3Blocks 
2 43 = D3Blocks() 


+ 4 d3.novingbubbles(df, is is a Simulation of multiple 
5 datetime='datetime', S. d3blocks 
sanple_id»'sanple_id', 
states'state', 
filepath='./novingbubbles.html' ) 
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Ever seen one of those moving points charts? Here's how you can 
create one in Python in just three lines of code. 


A Moving Bubbles chart is an elegant way to depict the movements of 
entities across time. Using this, we can easily determine when clusters 
appear in our data and at what state(s). 


To create one, you can use "d3blocks”. Its input should be a 
DataFrame. A row represents the state of a sample at a particular 
timestamp. 
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Skorch: Use Scikit-learn API on 
PyTorch Models 


model.py Define 
l 


(nn.Module): Pytorch model 


_(self) 
ne Network 


forward(self, x): 
Ht Forward Pass 


skorch NeuralNetCLlassifier 


Use model NeuralNetClassifier( 
MyModel, 


Scikit-learn Ba 
API on model nn.MSELoss 
) 


model.fit(X, y) 
preds model.predict(X) 
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skorch is a high-level library for PyTorch that provides full Scikit- 
learn compatibility. In other words, it combines the power of PyTorch 
with the elegance of sklearn. 


Thus, you can train PyTorch models in a way similar to Scikit-learn, 
using functions such as fit, predict, score, etc. 


Using skorch, you can also put a PyTorch model in the sklearn 
pipeline, and many more. 


Overall, it aims at being as flexible as PyTorch while having a clean 
interface as sklearn. 


Read more: Documentation. 
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Reduce Memory Usage Of A Pandas 
DataFrame By 90% 


Ht df.shape: (10*7, 2) 


>>> df.A.dtype 
dtype('int64') 
## Range: [-2°63, 2*°63-1] 


>>> dh Alma dthywemax© 
(1, 100) 


>>> df.A.memory_usage() 
Onomnhis 


dtl Av le— dij-A-astype(npaants) 
## Range: [-128, 127] 


>>> df.A.memory_usage() 


By default, Pandas always assigns the highest memory datatype to its 
columns. For instance, an integer-valued column always gets the int64 
datatype, irrespective of its range. 


To reduce memory usage, represent it using an optimized datatype, 
which is enough to span the range of values in your columns. 


Read this blog for more info. It details many techniques to optimize 
the memory usage of a Pandas DataFrame. 
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An Elegant Way To Perform 
Shutdown Tasks in Python 


ee Ə my_file.py 


import atexit 


@atexit.register 
def final_function(): 
print ("COMPLETED EXECUTION!") 


i in range(5): 
enie EU =) Ea 


© © Terminal 


$ python my_file.py 


num 
num 

num 

num 

num 4 

COMPLETED EXECUTION! 


Often towards the end of a program's execution, we run a few basic 
tasks such as saving objects, printing logs, etc. 


To invoke a method right before the interpreter is shutting down, 
decorate it with the @atexit.register decorator. 


The good thing is that it works even if the program gets terminated 
unexpectedly. Thus, you can use this method to save the state of the 
program or print any necessary details before it stops. 


Read more: Documentation. 
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Visualizing Google Search Trends of 
2022 using Python 


plot sns.FacetGrid(df_trend, 
"Term", 


Sar ey <— define grid 
plot.map(plt.plot, "Date", "Trend") <<——_—_. map plot to grid 


Search Trends: 2022 


Say CE 
Ukraine 

a a 
IPL 

Twitter Takeover M yj. 


Johnny Depp Ad 


WST 


Stable Diffusion 

India Pakistan '\ Å 
Queen Elizabeth Å 

Roger Federgja n i A 


Rishi Sunak = 4 
Tech Layoffs RN a By 


JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 
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If your data has many groups, visualizing their distribution together 
can create cluttered plots. This makes it difficult to visualize the 
underlying patterns. 


Instead, consider plotting the distribution across individual groups 
using FacetGrid. This allows you to compare the distributions of 
multiple groups side by side and see how they vary. 


As shown above, a FacetGrid allows us to clearly see how different 
search terms trended across 2022. 


P.S. I used the year-in-search-trends repository to fetch the 


trend data. 
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Create A Racing Bar Chart In Python 


[7] df.head() 


France Germany Italy Netherlands USA United Kingdom J 


2020-04-03 14681 1490 7418 
2020-04-04 15362 1656 8387 
2020-04-05 15887 
2020-04-06 16523 
2020-04-07 2035 10343 17127 


[9] ber.bar_chart_race(df=df,title='COVID-19 Deaths by Country’) 


COVID-19 Deaths by Country 


2020-04-06 
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Ever seen one of those racing bar charts? Here's how you can create 
one in Python in just two lines of code. 


A racing bar chart is typically used to depict the progress of multiple 
values over time. 


To create one, you can use the "bar-chart-race" library. 


Its input should be a Pandas DataFrame where every row represents a 
single timestamp. The column holds the corresponding values for a 
particular category. 


Read more: Documentation. 
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Speed-up Pandas Apply 5x with 
NumPy 


Pandas Apply 


assign_class(num): 
 num<10: 
“Class TAS 
num<50: 
SEVIS SAB 
CaS Sm Cu 


A.apply(assign_class) 


OZ SS =) Z20-5ans perm too 


® NumPy Select 


condlist L chri PANT le) , Cir EA J 
resultlist | Yelass A" , “less BY | 


np.select(condlist, resultlist, "Class C") 


~5x Faster Ht ©.20 s + 7.14 ms per loop N 
Default 
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While creating conditional columns in Pandas, we tend to use the 
apply() method almost all the time. 


However, apply() in Pandas is nothing but a glorified for-loop. As a 
result, it misses the whole point of vectorization. 


Instead, you should use the np.select() method to create conditional 
columns. It does the same job but is extremely fast. 


The conditions and the corresponding results are passed as the first two 
arguments. The last argument is the default result. 


Read more here: NumPy docs. 
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A No-Code Online Tool To Explore 
and Understand Neural Networks 


Visit: playground.tensorflow.org 


r With a Neural Network Right Here in Your Browser 
tV t Break It. W 


000,000 


FEATURES 
Which pi 
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Neural networks can be intimidating for beginners. Also, 
experimenting programmatically does not provide enough intuitive 
understanding about them. 


Instead, try TensorFlow Playground. Its elegant UI allows you to build, 
train and visualize neural networks without any code. 


With a few clicks, one can see how neural networks work and how 
different hyperparameters affect their performance. This makes it 
especially useful for beginners. 


Try here: Tensorflow Playground. 
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What Are Class Methods and When 
To Use Them? 


__init__(self, width, height): 
self.width = width Define 


self.height = height classmethod 


a@clLassmethod 
from_square(cls, size): 


Rectangle(size, size) 


create object 
Sous, § rect = Rectangle. from_square(5) 
classmethod 
print (rect.width) 
print(rect.height) 
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Class methods, as the name suggests, are bound to the class and not the 
instances of a class. They are especially useful for providing an 
alternative interface for creating instances. 


Moreover, they can be also used to define utility functions that are 
related to the class rather than its instances. 


For instance, one can define a class method that returns a list of all 
instances of the class. Another use could be to calculate a class-level 
statistic based on the instances. 


To define a class method in Python, use the @classmethod decorator. 
As a result, this method can be called directly using the name of the 
class. 


162 


(6) o2 
(] © e 
Ne avichawla.substack.com 


Make Sklearn KMeans 20x times 
faster 


sklearn. 
py x_train shape: 


(500000, 1024) 


sklearn.cluster KMeans 


kmeans KMeans(8).fit(x_train) 


faiss.py 


falss 


~20x Faster 


kmeans faiss.Kmeans ( 
kmeans.train(x_train) 
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The KMeans algorithm is commonly used to cluster unlabeled data. But 
with large datasets, scikit-learn takes plenty of time to train and 
predict. 


To speed-up KMeans, use Faiss by Facebook AI Research. It provides 
faster nearest-neighbor search and clustering. 


Faiss uses "Inverted Index", an optimized data structure to store and 
index the data points. This makes performing clustering extremely 
efficient. 


Additionally, Faiss provides parallelization and GPU support, which 
further improves the performance of its clustering algorithms. 


Read more: GitHub. 
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Speed-up NumPy 20x with Numexpr 


import numpy as np 
import numexpr as ne 


= np.random.random(10**7) 
= np.random.random(10**7) 


%timeit np.cos(a) + np.sin(b) 


Loop 


ee eS eS 


Stimeit ne.evaluate("cos(a) + sin(b)") 


32.5 ms + 229 us per loop ~5x Faster 
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Numpy already offers fast and optimized vectorized operations. Yet, it 
does not support parallelism. This provides further scope for improving 
the run-time of NumPy. 


To do so, use Numexpr. It allows you to speed up numerical 
computations with multi-threading and just-in-time compilation. 


Depending upon the complexity of the expression, the speed-ups can 
range from 0.95x and 20x. Typically, it is expected to be 2x-5x. 


Read more: Documentation. 
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A Lesser-Known Feature of Apply 
Method In Pandas 


ABC 
0°12 3) 2 

min_max (row): 
return max(row), min(row) 1-4 6 9 


>>> df.apply(min_max, axis 
result_type="expand") 


After applying a method on a DataFrame, we often return multiple 
values as a tuple. This requires additional steps to project it back as 
separate columns. 


Instead, with the result_type argument, you can control the shape and 
output type. As desired, the output can be either a DataFrame or a 
Series. 
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An Elegant Way To Perform Matrix 
Multiplication 


numpy np 


np.matmul(a, np.matmul(b, c)) 
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Matrix multiplication is a common operation in machine learning. Yet, 
chaining repeated multiplications using matmul function makes the 
code cluttered and unreadable. 


If you are using NumPy, you can instead use the @ operator to do the 
same. 
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Create Pandas DataFrame from 
Dataclass 


from dataclasses import dataclass 


a@dataclass 

Class Point: 
ne oye a Tihs 
VElocsinit 


points = [Pont (5,65) 
Point(1, 4), 
Point(2, 3)] 


pd.DataFrame(points) 


XO CRYMICOG 
O) 5 5 
l Al 4 
2 2 3 


A Pandas DataFrame is often created from a Python list, dictionary, by 
reading files, etc. However, did you know you can also create a 
DataFrame from a Dataclass? 


The image demonstrates how you can create a DataFrame from a list of 
dataclass objects. 
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Hide Attributes While Printing A 
Dataclass Object 


dataclasses dataclass 


@dataclass 


name:str 


key:str Prints all 
attributes 


Jane Student("Jane", "27HD") 


print (Jane) 


Student ( "Jane', 


dataclasses il dataclass, field 
@dataclass 


name:str 
key: 
Attribute 
hidden Jane = Student("Jane", "27HD") 


in print 


field( 


print (Jane) 
Student ( "Jane') 
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By default, a dataclass prints all the attributes of an object declared 
during its initialization. 


But if you want to hide some specific attributes, declare repr=False 
in its field, as shown above. 
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my_set 


my_dict {my_set: "A set"} 


peError: nasnal 


© Ə frozenset.py 


Use Hither 
frozenset nySeSe 


my_dict {my_set: "A frozen set"} 


my_dict[my_set] 
"A frozen set" 


in| linkedin.com/in/avi-chawla 


Dictionaries in Python require their keys to be immutable. As a result, 
a set cannot be used as keys as it is mutable. 


Yet, if you want to use a set, consider declaring it as a frozenset. 


It is an immutable set, meaning its elements cannot be changed after it 
is created. Therefore, they can be safely used as a dictionary’s key. 
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Difference Between Dot and Matmul 
in NumPy 


© © 2 dot.py 


>>> ainp.array # Shape: (a,b,c,d) 


>>> b:inp.array # Shape: (p,q,d,r) 


>>> np.dot(a, b) # Shape: (a,b,c,p,q,r) 


ee Ə matmul.py 


>>> a:inp.array # Shape: 


>>> b:inp.array # Shape: 


>>> np.matmul(a, b) 


The np.matmul() and np.dot() methods produce the same output for 
2D (and 1D) arrays. This makes many believe that they are the same 
and can be used interchangeably, but that is not true. 


The np.dot() method revolves around individual vectors (or 1D 
arrays). Thus, it computes the dot product of ALL vector pairs in the 
two inputs. 


The np.matmul() method, as the name suggests, is meant for 
matrices. Thus, it computes the matrix multiplication of corresponding 
matrices in the two inputs. 
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Run SQL in Jupyter To Analyze A 
Pandas DataFrame 


Filter-Pandas.ipynb 


oir LOR o CAEV "New Delhi"] 


Filter-SQL.ipynb 


DuckDB s 


df 
city "New Delhi'; 


in linkedin.com/in/avi-chawia 


Pandas already provides a wide range of functionalities to analyze 
tabular data. Yet, there might be situations when one feels comfortable 
using SQL over Python. 


Using DuckDB, you can analyze a Pandas DataFrame with SQL syntax 
in Jupyter, without any significant run-time difference. 


Read the guide here to get started: DOCS. 
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Automated Code Refactoring With 
Sourcery 


my_code.py 


e Before 
set_my_var(condition): 
condition: Refactoring 


my_var al E 


my_var 


my_var 


Command Line 


$ sourcery review --in-place my_code.py 


After my_code.py 
Refactoring 


set_my_var(condition): 
1 condition 


sourcery 


in linkedin.com/in/avi-chawla 
Refactoring codebase is an important yet time-consuming task. 
Moreover, at times, one might unknowingly introduce errors during 
refactoring. 


This takes additional time for testing and gets tedious with more 
refactoring, especially when the codebase is big. 


Rather than following this approach, use Sourcery. It's an automated 
refactoring tool that makes your code elegant, concise, and Pythonic in 
no time. 


With Sourcery, you can refactor code in many ways. For instance, you 
can refactor scripts through the command line, as an IDE plugin in VS 
Code and PyCharm, etc. 


Read my full blog on Sourcery here: Medium. 
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_ Post init : Add Attributes To A 
Dataclass Object Post Initialization 


@ dataclass.py 


dataclasses t dataclass 


@dataclass 
StudentMarks: 
student_id:str 
marks:float 


E SeTNE 
.marks>30: 
elf.grade "Pass" 


self.grade srani 


Peter StudentMarks("B20", 43 


print(Peter.grade) # Pass 


in linkedin.com/in/avi-chawla 


After initializing a class object, we often create derived attributes from 
existing variables. 


To do this in dataclasses, you can use the __post_init__ method. As 
the name suggests, this method is invoked right after the __init__ 
method. 


This is useful if you need to perform additional setups on your 
dataclass instance. 
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Simplify Your Functions With Partial 
Functions 


quadratic(x, a, 
ax (x**2) 


@ partial_function.py 


functools : ort partial 


quadratic_c1 partial(quadratic, c=1) 


quadratic_c1(x=1, 4, b=5) 
# Output: 10 


in| linkedin.com/in/avi-chawla 


When your function takes many arguments, it can be a good idea to 
simplify it by using partial functions. 


They let you create a new version of the function with some of the 
arguments fixed to specific values. 


This can be useful for simplifying your code and making it more 
readable and concise. Moreover, it also helps you avoid repeating 
yourself while invoking functions. 
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When You Should Not Use the head() 
Method In Pandas 


ee @ sort_values.py df 
Marks 
df.sort_values(by="Marks", 95 


ascending=False) .head(3) 
100 
Ignores =i 
Jane 199 repeated 


Mark 97 
Peter 95 values 


Name Marks 


ee @ niargest.py 


df.nlargest(n=3, 
coLlLumns="Marks", 
keep="all") 


Name Marks 
Jane 


Mark 
Peter 


David 


One often retrieves the top k rows of a sorted Pandas DataFrame by 
using head() method. However, there's a flaw in this approach. 


If your data has repeated values, head() will not consider that and just 
return the first k rows. 


If you want to consider repeated values, use nlargest (or nsmallest) 
instead. Here, you can specify the desired behavior for duplicate values 
using the keep parameter. 
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DotMap: A Better Alternative to 
Python Dictionary 


Ə dotmap.py 


dotmap import DotMap 
students DotMap() 


students.john.id TZAN 
students.john.english 


students.mary.id 12B% 
students.mary.english 


students .dave.id z124 
students.dave.english 34 


@ e@ dotmap-print.py 


Pretty 


Print people. pprint() 


daver <tYemeilincii's s¥i, Yaiel?s Valarie. 
"JO § CEMA s 25, Valls Valay\ yp. 
marva eene tIsSha ZIG) Usie|4s Yalayssy yey 
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Python dictionaries are great, but they have many limitations. 


It is difficult to create dynamic hierarchical data. Also, they don't offer 
the widely adopted dot notation to access values. 


Instead, use DotMap. It behaves like a Python dictionary but also 
addresses the above limitations. 


What's more, it also has a built-in pretty print method to display it as a 
dict/JSON for debugging large objects. 


Read more: GitHub. 
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Prevent Wild Imports With all _in 
Python 


e@ my_functions.py 


KASTES USA] 


funci1(): 
"Function 1" 


PRUNE ZO 
"Function 2" 


func3(): 
"Eunction 3" 


Only 


® @ module.py 
ane 


my_functions 


func1() 
"Function 1" 


func2() 
"Function 2" 


func3() 
NameError: name 'func3' defined 


in linkedin.com/in/avi-ciiawia 


Wild imports (from module import *) are considered a bad 
programming practice. Yet, here's how you can prevent it if someone 
irresponsibly does that while using your code. 


In your module, you can define the importable 
functions/classes/variables in __all__. As a result, whenever someone 
will do a wild import, Python will only import the symbols specified 
here. 


This can be also useful to convey what symbols in your module are 
intended to be private. 
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Three Lesser-known Tips For Reading 
a CSV File Using Pandas 


pd.read_csv("data.csv", 
nrows = ) 


pd.read_csv("data.csv", 
UsSecolse= sii Awe Heti) 


pd.read_csv("data.csv", 
skiprows = 


pd.read_csv("data.csv", 
skiprows 


Here are three extremely useful yet lesser-known tips for reading a 
CSV file with Pandas: 


1. If you want to read only the first few rows of the file, specify the 
nrows parameter. 


2. To load a few specific columns, specify the usecols parameter. 


3. If you want to skip some rows while reading, pass the skiprows 
parameter. 
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The Best File Format To Store A 
Pandas DataFrame 


CSV.py 


di tomcsViemilencsVviia) 


Parquet.py Feather.py 


df.to_parquet("file.parquet") df.to_feather("file.feather") 


Pickle.py 


df.to_pickle("file.pickle") 


in linkedin.com/in/avi-chawla 


In the image above, you can find the run-time comparison of storing a 
Pandas DataFrame in various file formats. 


Although CSVs are a widely adopted format, it is the slowest format in 
this list. 


Thus, CSVs should be avoided unless you want to open the data 
outside Python (in Excel, for instance). 


Read more in my blog: Medium. 


179 


© 
o2 
) © e 
Ne avichawla.substack.com 


Debugging Made Easy With 
PySnooper 


©% py-snooper.py 


- pysnooper 


3 @pysnooper.snoop() 
add_sub(a, b): 


add a+b 
sub = a-b 


(add, sub) 


add_sub(9, 5) 


4 def add_sub(a, b): 
6 add = a+b 


sub = a-b 


ı (add, sub) 


Return value:.. 


in| linkedin.com/in/avi-chawla 


Rather than using many print statements to debug your python code, 
try PySnooper. 


With just a single line of code, you can easily track the variables at 
each step of your code's execution. 


Read more: Re pository. 
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Lesser-Known Feature of the Merge 
Method in Pandas 


pd.merge(name_df, rewards_df, 
on =" CuUSTe ID... 
how = "outer", 
indicator = 


Cust_ID Name Rewards _merge 


Joe NaN Left_only 
Mark 50 both 
Peter 20 both 
NaN 70 right_only 


While merging DataFrames in Pandas, keeping track of the source of 
each row in the output can be extremely useful. 


You can do this using the indicator argument of the merge() method. 
As a result, it augments an additional column in the merged output, 
which tells the source of each row. 
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The Best Way to Use Apply() in 


Pandas 


Rows, 4 Cols 


def sum_of_row(row): 
return sum(row) 


ee @ Pandas-Apply.py 


df.apply(sum_of_row, 
axis = 1) 
HH 44 sec 


© © @ Pandarallel.py 


df.parallel_apply(sum_of_row, 
Eyles = al) 
## 12 sec (70% Faster) 


ee 2 Swifter.py 


df.swifter.apply(sum_of_row, 
axTS SnI.) 
## 22 sec (50% Faster) 


© © @ parallel-pandas.py 


df.p_apply(sum_of_row, 
Bag = 51) 
## 12 sec (70% Faster) 


ee @ Mapply.py 


df.mapply(sum_of_row, 


= 1) 


The image above shows a run-time comparison of popular open-source 
libraries that provide parallelization support for Pandas. 


You can find the links to these libraries here. Also, if you know any 
other similar libraries built on top of Pandas, do post them in the 


comments or reply to this email. 
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Deep Learning Network Debugging 
Made Easy 


network.py 


tsensor 


tsensor.explain(): 
i range(n_batches): 
batch XEL a] 
y torch.matmul(w, batch.T) b 


batch = x EAFA Output 


{3 


<float32> <float32> 


Y  =torch.matmul( W , batch.T) + 
10 10 


<float32> <float32> <float32> <float32> 


in linkedin.com/in/avi-chawla 


Aligning the shape of tensors (or vectors/matrices) in a network can be 
challenging at times. 


As the network grows, it is common to lose track of dimensionalities in 
a complex expression. 


Instead of explicitly printing tensor shapes to debug, use 
TensorSensor. It generates an elegant visualization for each 
statement executed within its block. This makes dimensionality 
tracking effortless and quick. 


In case of errors, it augments default error messages with more helpful 
details. This further speeds up the debugging process. 


Read more: Documentation 
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Lovely-NumPy Instead. 


array .random.rand(...) 


array 
tensor([[0.5' 
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Only 
numbers 


lovely_numpy.py 


Lovely_numpy lo 


array np.random.rand(... 


lo(array) 
array[10, 20] n=200 
x€[0.0, 1.0] u=0.51 o=0.3 


array np.zeros(... 
Llo(array) 
array[10] all_zeros 


array 
lo(array) 
array[10, 
xE[0.0, 1.0] p=0.51 


We often print raw numpy arrays during debugging. But this approach 


is not very useful. This is because printing does not convey much 


information about the data it holds, especially when the array is large. 


Instead, use lovely-numpy. Rather than viewing raw arrays, it prints 


a summary of the array. This includes its shape, distribution, mean, 


standard deviation, etc. 


It also shows if the numpy array has NaNs and Inf values, whether it is 
filled with zeros, and many more. 


P.S. If you work with tensors, then you can use lovely-tensors. 


Read more: Documentation. 
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Performance Comparison of Python 
3.11 and Python 3.10 


ee @ fib.py ee @ calc_pi.py 


def fib(N): def pi_approx(n_terms) : 


nnn nun 


Function to find the Function to find the 
Nth Fibonacci number. approximate value of pi. 


fib(N) = fib(N-1) + fib(N-2) fol, ee GES (TTS) 


ELETI nun 


© © @ python3.10.py © © @ python3.11.py 


>>> fib(30) >>> fib(30) 

# Time: 260ms # Time: S 
>>> fib(40) >>> fib(40) 
# Time: 32s # Time: 


>>> pi_approx(10+*6) >>> pi_approx(10««6 
# Time: 144ms # Time: 65ms (55% Faster) 


Python 3.11 was released recently, and as per the official release, it is 
expected to be 10-60% faster than Python 3.10. 


I ran a few basic benchmarking experiments to verify the performance 
boost. Indeed, Python 3.11 is much faster. 


Although one might be tempted to upgrade asap, there are a few things 
you should know. Read more here. 
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View Documentation in Jupyter 
Notebook 


In [1]: 1 import pandas as pd 
executed in 1.28s, finished 15:24:30 2022-12-06 


rn pe 1 pd.DataFrame() # Shift-Tab 


In [ ]:| Init signature: 


pd. DataFrame( 
data=None, 
index: ‘Axes | None' = None, 
columns: ‘Axes | None' = None, 
dtype: 'Dtype | None’ = None, 
copy: ‘bool | None' = None, 

) 

Docstring: 


Tuin-dimancinnal ciza-mitahla natantiallw hatarnnananiic tahilar data 


in linkedin.com/in/avi-chawla 


While working in Jupyter, it is common to forget the parameters of a 
function and visit the official docs (or Stackoverflow). However, you 
can view the documentation in the notebook itself. 


Pressing Shift-Tab opens the documentation panel. This is extremely 
useful and saves time as one does not have to open the official docs 
every single time. 


This feature also works for your custom functions. 


View a video version of this post on LinkedIn: Post Link. 


186 


@ 
8 : 
e\o avichawla.substack.com 


A No-code Tool To Understand Your 
Data Quickly 


pandas_profiling ProfileReport 


profile ProfileReport(iris_data, 
"Pandas Profiling Report") 


profile.to_widgets() 


Overview Variables Interactions Correlations Missing values Sample Duplicate rows 
Overview Alerts (7) Reproduction 


Dataset has 1 (0.7%) duplicate rows Duplicates 

sepal length (cm) is highly correlated with sepal width (cm) and 3.atber.fields. High correlation 
petal length (cm) is highly correlated with sepal length (cm) and 3.atber.fields High correlation 
petal width (cm) is highly correlated with sepal length (cm) and 3.atherfields High correlation 
target is highly correlated with sepal length (cm) and 3.other.fields High correlation 


sepal width (cm) is highly correlated with sepal Length (cm) and 3.other.fields High correlation 


target is uniformly distributed | Unom | 


Report generated by YData. 
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The preliminary steps of any typical EDA task are often the same. Yet, 
across projects, we tend to write the same code to carry out these tasks. 
This gets repetitive and time-consuming. 


Instead, use pandas-profiling. It automatically generates a 
standardized report for data understanding in no time. Its intuitive UI 
makes this effortless and quick. 


The report includes the dimension of the data, missing value stats, and 
column data types. What's more, it also shows the data distribution, the 
interaction and correlation between variables, etc. 


Lastly, the report also includes alerts, which can be extremely useful 
during analysis/modeling. 


Read more: Documentation. 
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Why 256 is 256 But 257 is not 257? 


IPython 


SSS A; 


æ> aA iS i 


Comparing python objects can be tricky at times. Can you figure out 
what is going on in the above code example? Answer below: 


When we run Python, it pre-loads a global list of integers in the range 
[-5, 256]. Every time an integer is referenced in this range, Python 
does not create a new object. Instead, it uses the cached version. 


This is done for optimization purposes. It was considered that these 
numbers are used a lot by programmers. Therefore, it would make 
sense to have them ready at startup. 
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However, referencing any integer beyond 256 (or before -5) will create 
a new object every time. 


In the last example, when a and b are set to 257 in the same line, the 
Python interpreter creates a new object. Then it references the second 
variable with the same object. 


Share this post on LinkedIn: Post Link. 


The below image should give you a better understanding: 


Different 


print(a # Output: False # Output: True 


print(a 


in| linkedin.com/in/avi-chawla 
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Make a Class Object Behave Like a 
Function 


class Quadratic: 
def lee (SO ds Dp eC): 


defi —-call=_ (self, x): 
return (self.a * x**2) + 


fic= Quadratic(l, 2:55) 


open eG maby) e Oeo E 


eee) ts Outou ala 


print(callable(f)) # Output: 


If you want to make a class object callable, i.e., behave like a function, 
you can do so by defining the __call__ method. 


This method allows you to define the behavior of the object when it is 
invoked like a function. 
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This can have many advantages. For instance, it allows us to 
implement objects that can be used in a flexible and intuitive way. 
What's more, the familiar function-call syntax, at times, can make your 
code more readable. 


Lastly, it allows you to use a class object in contexts where a callable 
is expected. Using a class as a decorator, for instance. 
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Lesser-known feature of Pickle Files 


ee @ dump.py 


import pickle 


With open(data.pkU... “Wwo) as -f-; 
pickle.dump(a, f) 
pickle.dump(b, f) 
pickle.dump(c, f) 


© © Ə load.py 


import pickle 


with open("data.pkl", "rb") as f: 
pickle. load(f) 


a 
b pickle. load(f) 


print(fi ta =") do = pE) 
Ht a=4i1iObO 


Pickles are widely used to dump data objects to disk. But folks often 
dump just a single object into a pickle file. Moreover, one creates 
multiple pickles to store multiple objects. 


However, did you know that you can store as many objects as you want 
within a single pickle file? What's more, when reloading, it is not 
necessary to load all the objects. 


Just make sure to dump the objects within the same context manager 
(using with). 
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Of course, one solution is to store the objects together as a tuple. But 
while reloading, the entire tuple will be loaded. This may not be 
desired in some cases. 
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Dot Plot: A Potential Alternative to 
Bar Plot 


Dot_vs_Bar.py 


20 
px.bar(df, "Year", 
"Population", 3 
"Country") 
40 
px.scatter(df, "Population", 20 
"Year", p 
1996 


"Country") 4996 199,1899 199p 200,20, OTE 009019 


Population 


Country 
@ Country A 
@ Country B 


=> 
= 
2 
= 
E 
= 
c 
2 
= 
ng 
3 
a 
fo) 
a 
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Bar plots are extremely useful for visualizing categorical variables 
against a continuous value. But when you have many categories to 
depict, they can get too dense to interpret. 


In a bar plot with many bars, we’re often not paying attention to the 
individual bar lengths. Instead, we mostly consider the individual 
endpoints that denote the total value. 


A Dot plot can be a better choice in such cases. They are like scatter 
plots but with one categorical and one continuous axis. 


Compared to a bar plot, they are less cluttered and offer better 
comprehension. This is especially true in cases where we have many 
categories and/or multiple categorical columns to depict in a plot. 


Read more: Documentation. 
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Why Correlation (and Other 
Statistics) Can Be Misleading. 


No Outliers 


without_outliers.py 


df.corr() 
output 
value 0.89 


with_outliers.py 


df.corr(df.output) 
output 
value 0.08 


in linkedin.com/in/avi-chawla 2 Outliers 


Correlation is often used to determine the association between two 
continuous variables. But it has a major flaw that often gets unnoticed. 


Folks often draw conclusions using a correlation matrix without even 
looking at the data. However, the obtained statistics could be heavily 
driven by outliers or other artifacts. 


This is demonstrated in the plots above. The addition of just two 
outliers changed the correlation and the regression line drastically. 


Thus, looking at the data and understanding its underlying 
characteristics can save from drawing wrong conclusions. Statistics are 
important, but they can be highly misleading at times. 
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Supercharge value_counts() Method 
in Pandas With Sidetable 


, City 
sidetable.py West Jamesview 
Aliciafort 
New Cindychester 
sidetable Gisa ae S 
Kristaburgh 
Wardfort 
New Russellton 
Whitakerbury 
North Melissafurt 


di stb frea | eauieyy "| 


City count percent cumulative_count cumulative_percent 

West Jamesview 120 12.00% 120 12.00% 
Aliciafort 113 11.30% 233 23.30% 
Ricardomouth 106 10.60% 339 33.90% 
New Cindychester 106 10.60% 445 44.50% 
Whiteside 104 10.40% 549 54.90% 
Kristaburgh 97 9.70% 646 64.60% 
Wardfort 96 9.60% 742 74.20% 

New Russellton 93 9.30% 835 83.50% 
Whitakerbury 87 8.70% 922 92.20% 
North Melissafurt 78 7.80% 1,000 100.00% 
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The value_counts() method is commonly used to analyze categorical 
columns, but it has many limitations. 


For instance, if one wants to view the percentage, cumulative count, 
etc., in one place, things do get a bit tedious. This requires more code 
and is time-consuming. 


Instead, use sidetable. Consider it as a supercharged version of 
value_counts(). As shown below, the freq() method from sidetable 
provides a more useful summary than value_counts(). 


Additionally, sidetable can aggregate multiple columns too. You can 
also provide threshold points to merge data into a single bucket. What's 
more, it can print missing data stats, pretty print values, etc. 


Read more: GitHub. 
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Write Your Own Flavor Of Pandas 


@ my_pandas.py 


pandas pd 
pandas_flavor pf 


apf.register_dataframe_method 
add_row(df, row): 


df.loc[len(df) ] row 


2. Import module 


2@ project.py 


my_pandas 


Planets Position 
Mercury 1 


O Mercury - VENUE 2 


1 Venus new_row [Earth 3 


2 Earth df .add_row(new_row) 


Aa 3. "add_row" 


in linkedin.com/in/avi-chawla attached to df 


If you want to attach a custom functionality to a Pandas DataFrame (or 
series) object, use "pandas-flavor". 


Its decorators allow you to add methods directly to the Pandas' object. 


This is especially useful if you are building an open-source project 
involving Pandas. After installing your library, others can access your 
library's methods using the dataframe object. 


P.S. This is how we see df.progress_apply() from tqdm, 
df.parallel_apply() from Pandarallel, and many more. 


Read more: Documentation. 
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CodeSquire: The AI Coding Assistant 
You Should Use Over GitHub Copilot 


D CodeSquire.ai 


from catboost.datasets import titanic 
import numpy as np 


import pandas as pd Write 
from catboost import CatBoostClassifier 
comment 


train_df, test_df = titanic() 
train_df.head() 
# one hot encode all the categorical vars in train_df and test_df 


train_df = pd.get_dummies(train_df, columns=['‘Sex', ‘Embarked']) 
test_df = pd.get_dummies(test_df, columns=['Sex', ‘Embarked']) 
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Coding Assistants like GitHub Copilot are revolutionary as they offer 
many advantages. Yet, Copilot has limited utility for data 
professionals. This is because it's incompatible with web-based IDEs 
(Jupyter/Colab). 


Moreover, in data science, the subsequent exploratory steps are 
determined by previous outputs. But Copilot does not consider that 
(and even markdown cells) to drive its code suggestions. 


CodeSquire is an incredible AI coding assistant that addresses the 
limitations of Copilot. The good thing is that it has been designed 
specifically for data scientists, engineers, and analysts. 


Besides seamless code generation, it can generate SQL queries from 
text and explain code. You can leverage Al-driven code generation by 
simply installing a browser extension. 


Read more: CodeSquire. 
Watch a video version of this post on LinkedIn: Post Link. 
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Vectorization Does Not Always 
Guarantee Better Performance 


Name 


Vectorized.py Beth Alvarez 


Deborah Watkins 
df.Name.str.split() Jeffrey Compton 
Alan Wolfe 


Kathryn Gordon 
Non-vectorized.py 


name_split(s): 
SeSpilit () 


df.Name.apply(name_split ) 
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Vectorization is well-adopted for improving run-time performance. In a 
nutshell, it lets you operate data in batches instead of processing a 
single value at a time. 


Although vectorization is extremely effective, you should know that it 
does not always guarantee performance gains. Moreover, vectorization 
is also associated with memory overheads. 


As demonstrated above, the non-vectorized code provides better 
performance than the vectorized version. 


P.S. apply() is also a for-loop. 


Further reading: Here. 
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In Defense of Match-case Statements 
in Python 


ee e if-else.py 


def make_point (point): 
if isinstance(point, (tuple, list)): 


ap We(oennic)) = BE 
X WY S fooaliae 
return Point3D(x, y, 0) 


elif Len(point) = 3 
My Wy Z = Ponne 
return Point3D(x, y, z) 


else: 
raise TypeError("Unsupported") 
else: 
raise TypeError("Unsupported") 


© © @ match-case.py 


def make_point (point): 
match point: 
case (x, y): 
return Point3D(x, y, 0) 


case nX EVIR): 
return Point3D(x, y, z) 


case _: Default 


raise TypeError ("Unsupported") 


>>> make_point( (1, 
Point3D(x=1, y=2 


>>> make_point([1, 
Point3D(x=1, y=2, z=0) 


>>> make_point((1, 2, 3, 4)) 
TypeError: Unsupported 
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I recently came across a post on match-case in Python. In a gist, 
starting Python 3.10, you can use match-case statements to mimic the 
behavior of if-else. 


Many responses on that post suggested that if-else offers higher 
elegance and readability. Here's an example in defense of match-case. 


While if-else is traditionally accepted, it also comes with many 
downsides. For instance, many-a-times, one has to write complex 
chains of nested if-else statements. This includes multiple calls to 
len(), isinstance() methods, etc. 


Furthermore, with if-else, one has to explicitly destructure the data to 
extract values. This makes your code inelegant and messy. 


Match-case, on the other hand, offers Structural Pattern Matching 
which makes this simple and concise. In the example above, match- 
case automatically handles type-matching, length check, and variable 
unpacking. 


Read more here: Python Docs. 
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Enrich Your Notebook With 
Interactive Controls 


Add Drop-down 


ipywidgets interact 


@interact 
pLot_company (company (df.Company.unique())): 


Fi er he 


df_filtered df [df .Company company] 


rame 


df_filtered.plot( 


company v White, Mcclain and Cobb select com p any 
Scott Inc Eccc 
Andrade LLC from dropdown 
James and Sons 
Matthews Inc 

= Marshall-Holloway 
Campos, Reynolds and Mccormick © oy B 
Baker, Allen and Edwards oc © 

7!  Nelson-Li ® 
Thomas-Spencer 
Johnston, Fleming and Tanner 

| Nichols-James 
Taylor-Ramos 
Bullock-Carrillo 
Wallace, Smith and Shepard 


Employment_Status 
» Full Time 
@ Intern 


Employee_Rating 


0.4M 0.6M 


Employee_Salary 


While using Jupyter, we often re-run the same cell repeatedly after 
changing the input slightly. This is time-consuming and also makes 
your data exploration tasks tedious and unorganized. 
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Instead, pivot towards building interactive controls in your notebook. 
This allows you to alter the inputs without needing to rewrite and re- 
run your code. 


In Jupyter, you can do this using the IPywidgets module. Embedding 
interactive controls is as simple as using a decorator. 


As a result, it provides you with interactive controls such as 
dropdowns and sliders. This saves you from tons of repetitive coding 
and makes your notebook organized. 


Watch a video version of this post on LinkedIn: Post Link. 
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Get Notified When Jupyter Cell Has 
Executed 


notebook.ipynb 


In [1]: %load_ext jupyternotify 


wa [ee notify 


Jupyter Notebook 
localhost:8888 
Cell execution has finished! 


in linkedin.com/in/avi-chawla 


After running some code in a Jupyter cell, we often navigate away to 
do some other work in the meantime. 


Here, one has to repeatedly get back to the Jupyter tab to check 
whether the cell has been executed or not. 


To avoid this, you can use the %%notify magic command from the 
jupyternotify extension. As the name suggests, it notifies the user 
upon completion (both successful and unsuccessful) of a jupyter cell 
via a browser notification. Clicking on the notification takes you back 
to the jupyter tab. 


Read more: GitHub. 
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Data Analysis Using No-Code Pandas 
In Jupyter 


+ KOUD 4t v PRU BC PW Code v Œ |Æ Create New Mitosheet Snippets» © & @ Æ > Onbdiff 


In [1]: l1 import mitosheet 
2 mitosheet.sheet(analysis_to_replay="id-ymyxvhaoces") 


executed in 4.97s, finished 14:48:34 2022-11-22 


Edit Dataframes Columns Rows Graphs Format View Help 
5 eejo &1/G & TY) 0 0 iv] BH il 2 7 
Undo Redo Clear Import Export AddCol DelCol Dtype Less More Number Pivot Graph Steps Fullscreen 
Employee_City | Employee_City 


Name v Company_Nam Vv Employee_City 7 | Employee_Sal v Employment_st Y | Employee_Rat y: 
str |e str | : str | ary float | atus str | ing float 


Christopher Jones | Matthews Inc Aliciafort 5,187.04 | Full Time 1.50 
Mitchell Hill Baker, Allen and Edw Aliciafort 4,078.71 | Full Time 1.40 
Dawn Bailey | White, Mcclain and C{ Aliciafort 11,379.39 | Full Time 4.50 
Donald Bowman Scott Inc Aliciafort 4,292.43 | Full Time 1.30 
Kelly Liu Matthews Inc Aliciafort 4,413.51 | Intern 0.90 
David Mills Johnston, Fleming an Aliciafort 6,917.60 | Intern 1.90 
Vanessa Lamb Taylor-Ramos Aliciafort 8,391.54 | Full Time 2.70 
Douglas Kennedy | Andrade LLC Aliciafort 2,815.63 | Full Time 0.80 
Jeffrey Gonzalez Taylor-Ramos Aliciafort 10,401.33 | Full Time 4.60 
Emily Weber Matthews Inc Kristaburgh 7,676.39 | Intern 2.30 
Zachary Ellison James and Sons Kristaburgh 7,194.54 | Full Time 2.60 
Gina Acosta Nichols-James Kristaburgh 7,239.98 | Full Time 2.10 
Jason Reyes | Matthews Inc Kristaburgh 6,760.68 | Full Time 2.80 
James Wright Nelson-Li Kristaburgh 3,980.27 | Intern 1.00 


+ I employee_dataset ~ EM) graph0 v (100 rows, 6 cols) 
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The Pandas API provides a wide range of functionalities to analyze 
tabular datasets. 


Yet, across projects, we often use the same methods over and over to 
analyze our data. This quickly gets repetitive and time-consuming. 


To avoid this, use Mito. It's an incredible tool that allows you to 
analyze your data within a spreadsheet interface in Jupyter, without 
writing any code. 


The coolest thing about Mito is that each edit in the spreadsheet 
automatically generates an equivalent Python code. This makes it 
extremely convenient to reproduce the analysis later. 


Read more: Documentation. 
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Using Dictionaries In Place of If- 
conditions 


ee Ə if_else.py 


number = int(input 


air [AWiirloere == Le 
func1() 


elif number = 2: 
func2() 


else: 
func3() 


© © @ dict.py 


number = int(input 


füncamap = ANUNCIE 
Ze TUNC 


func_map.get(number, func3)() 


Dictionaries are mainly used as a data structure in Python for 
maintaining key-value pairs. 


However, there's another special use case that dictionaries can handle. 
This is — Eliminating IF conditions from your code. 


Consider the code snippet above. Here, corresponding to an input 
value, we invoke a specific function. The traditional way requires you 
to hard-code every case. 
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But with a dictionary, you can directly retrieve the corresponding 
function by providing it with the key. This makes your code concise 
and elegant. 


207 


@ 
A 
one . 
Ne avichawla.substack.com 


Clear Cell Output In Jupyter 
Notebook During Run-time 


time 
IPython.display clear_output 


clear_output( 


(f'Output Number 
time.sleep(1) 


for i in range(100): 


## Wait for the next 
## output before clearing 
clear_output (wait=True) 


print(f'Output Number {i+1}') 
time.sleep(1) 


executed in 1m 40.6s, finished 15:55:44 2022-11-19 


Output Number 100 )<—— Only Last Output 
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While using Jupyter, we often print many details to track the code's 
progress. 


However, it gets frustrating when the output panel has accumulated a 
bunch of details, but we are only interested in the most recent output. 
Moreover, scrolling to the bottom of the output each time can be 
annoying too. 


To clear the output of the cell, you can use the clear_output method 
from the IPython package. When invoked, it will remove the current 
output of the cell, after which you can print the latest details. 
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A Hidden Feature of Describe Method 
In Pandas 


Only Numerical 
Columns 


coli col2 


3.0 3.0 


3.0 4.0 


df 
Colt col2 cols colt4 


2.0 2.0 
1.0 2.0 
2.0 3.0 


3.0 4.0 
df.describe() 4.0 5.0 


5.0 6.0 


All Columns 


df.describe( want) 
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The describe() method in Pandas is commonly used to print 
descriptive statistics about the data. 


But have you ever noticed that its output is always limited to numerical 
columns? Of course, details like mean, median, std. dev., etc. hold no 
meaning for non-numeric columns, so the results make total sense. 


However, describe() can also provide a quick summary of non- 
numeric columns. You can do this by specifying include="all." As a 
result, it will return the number of unique elements, the top element 
with its frequency. 


Read more: Documentation. 
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Use Slotted Class To Improve Your 
Python Code 


ee ® Without_slots.py 


class Person: 
__(self, name, age): 
Lf.Name = name 
f.Age = age 


person = Person('Mike', 22) 


person.name = 'Peter' 


## No Error 


ee ® With_slots.py 


class Person: 
Eslotsae = Namen ASeial 


def __init__(self, name, age): 
F.Name = name 


F.Age = age 


person = Person('Mike', 22 


person.name = 'Peter' 


an mi 
ttribute 


If you want to fix the attributes a class can hold, consider defining it as 
a slotted class. 


While defining classes, __slots__ allows you to explicitly specify the 
class attributes. This means you cannot randomly add new attributes to 
a slotted class object. This offers many advantages. 


For instance, slotted classes are memory efficient and they provide 
faster access to class attributes. What's more, it also helps you avoid 
common typos. This, at times, can be a costly mistake that can go 
unnoticed. 


Read more: StackOverflow. 
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Stop Analysing Raw Tables. Use 
Styling Instead! 


df_bar.py df_gradient.py 


data.style.bar ( data.style.background_gradient ( 
"Lightgreen', "Blues', 
Count t) UCounitu) 


Currency Currency Count 


-h 


2 2 
3 3 
4 4 
5 5 
6 6 
7 7 
8 8 
9 9 


=à 
o 

=b 

o 


— 

a 
= 
a 
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Jupyter is a web-based IDE. Thus, whenever you print/display a 
DataFrame in Jupyter, it is rendered using HTML and CSS. 


This means you can style your output in many different ways. 


To do so, use the Styling API of Pandas. Here, you can make many 
different modifications to a DataFrame's styler object (df.style). As a 
result, the DataFrame will be displayed with the specified styling. 


Styling makes these tables visually appealing. Moreover, it allows for 
better comprehensibility of data than viewing raw tables. 


Read more here: Documentation. 
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Explore CSV Data Right From The 
Terminal 


data.csv 
Name Marks Grade 
ee Excel to CSV Joe 95 A 
Hanna 89 B 
S$ in2csv data.xlsx > data.csv ean 24 a 
Julie 94 A 


ee Column Stats 


$ esvstat data.csv 


ee Column Names 2p MACKS u 
Type of data: Number 


Contains null values: False 
$ ecsvcut -n data.csv Unique values: 


1: Name Smallest value: 


2: Marks Largest value: 
3: Grade Sum: 


Mean: 
Median: 
StDev: 


Query 


csvsql --query "select * from data where Marks>90" data.csv 
Name | Marks | Grade | 


If you want to quickly explore some CSV data, you may not always 
need to run a Jupyter session. 


Rather, with "csvkit", you can do it from the terminal itself. As the 
name suggests, it provides a bunch of command-line tools to facilitate 
data analysis tasks. 


These include converting Excel to CSV, viewing column names, data 
statistics, and querying using SQL. Moreover, you can also perform 
popular Pandas functions such as sorting, merging, and slicing. 


Read more: Documentation. 
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Generate Your Own Fake Data In 
Seconds 


fake_data.py 


faker import 
fake = Faker() 


fake.name() 
'Darrell Alexander ' 


fake.email() 
'ryanrichardðexample.com' 


fake.address() 
'205 Brown Point, West Melissaport, MN 93828' 


fake.company () 
"Lam, Thomas and Cooper' 


fake.date_of_birth() 
datetime.date(1973, 1, 21) 


fake.color_name() 
'LightBlue' 
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Usually, for executing/testing a pipeline, we need to provide it with 
some dummy data. 


Although using Python's "random" library, one can generate random 
strings, floats, and integers. Yet, being random, it does not output any 
meaningful data such as people's names, city names, emails, etc. 


Here, looking for open-source datasets can get time-consuming. 
Moreover, it's possible that the dataset you find does not fit pretty well 
into your requirements. 


The Faker module in Python is a perfect solution to this. Faker allows 
you to generate highly customized fake (yet meaningful) data quickly. 
What's more, you can also generate data specific to a demographic. 


Read more here: Documentation. 
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Import Your Python Package as a 


Module 


directory_old 


project 
— model 
|— train.py 
L_ test.py 
L— pipeline.py 


directory_new 


Add 
init__.py 


project 

ļ|— model 

| H __init__.py 
| train.py 


L— test.py 


| 
| 
= 


pipeline. py 


Redundant 


Imports oe 
— init__.py 


.train 
.test 


Training 
Testing 


pipeline.py 


model.train 
model.test 


Training 
Testing 


pipeline.py 
Import from 


Package like 
a Module 


model Training, Testing 
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A python module is a single python file (.py). An organized collection 
of such python files is called a python package. 


While developing large projects, it is a good practice to define an 
__init__.py file inside a package. 


Consider train.py has a Training class and test.py has a Testing 
class. 


Without __init__.py, one has to explicitly import them from specific 
python files. As a result, it is redundant to write the two import 
statements. 


With __init__.py, you can group python files into a single importable 
module. In other words, it provides a mechanism to treat the whole 
package as a python module. 


This saves you from writing redundant import statements and makes 
your code cleaner in the calling script. 


Read more in this blog: Blog Link. 
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Specify Loops and Runs In %%timeit 


number of number of 
loops (1000) runs (4) 


notebok.ipynb 


Ifa) [Palit timeit -n 


time.sleep (2) 
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We commonly use the %timeit (or %%timeit) magic command to 
measure the execution time of our code. 


Here, timeit limits the number of runs depending on how long the 
script takes to execute. This is why you get to see a different number 
of loops (and runs) across different pieces of code. 


However, if you want to explicitly define the number of loops and 
runs, use the -n and -r options. Use -n to specify the loops and -r for 
the number the runs. 
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Waterfall Charts: A Better Alternative 
to Line/Bar Plot 


Waterfall.py 


waterfall_chart 


waterfall_chart.pLlot(df.Month, 
df .Delta) 


PP wt poh nod pO Ph po cet oF Yo! gee 


in linkedin.com/in/avi-chawla 


If you want to visualize a value over some period, a line (or bar) plot 
may not always be an apt choice. 


A line-plot (or bar-plot) depicts the actual values in the chart. Thus, 
sometimes, it can get difficult to visually estimate the scale of 
incremental changes. 


Instead, you can use a waterfall chart, which elegantly depicts these 
rolling differences. 


To create one, you can use waterfall_chart in Python. Here, the start 
and final values are represented by the first and last bars. Also, the 
marginal changes are automatically color-coded, making them easier to 
interpret. 


Read more here: GitHub. 
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Hexbin Plots As A Richer Alternative 
to Scatter Plots 


scatter.py 


df.plot ( "scatter', 


' 


UU, 
tyt) 


hexbin.py 


df.plot ( "hexbin', 
Jove 


y') 
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Scatter plots are extremely useful for visualizing two sets of numerical 
variables. But when you have, say, thousands of data points, scatter 
plots can get too dense to interpret. 


Hexbins can be a good choice in such cases. As the name suggests, 
they bin the area of a chart into hexagonal regions. Each region is 
assigned a color intensity based on the method of aggregation used (the 
number of points, for instance). 


Hexbins are especially useful for understanding the spread of data. It is 
often considered an elegant alternative to a scatter plot. Moreover, 
binning makes it easier to identify data clusters and depict patterns. 
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Importing Modules Made Easy with 
Pyforest 


import 
import 
import 


import 
import 


pandas as pd 

numpy as np 
matpLotlib.pyplot as plt 
seaborn as sns 

sys 


from sklearn. linear_model 


import 


LinearRegression 


from pyforest import x 


pd.read_csv("file.csv") 
np array ((1,2,31) 
sys.path 
LinearRegression() 


The typical programming-related stuff in data science begins by 
importing relevant modules. 


However, across notebooks/projects, the modules one imports are 
mostly the same. Thus, the task of importing all the individual libraries 
is kinda repetitive. 


With pyforest, you can use the common Python libraries without 
explicitly importing them. A good thing is that it imports all the 
libraries with their standard conventions. For instance, pandas is 
imported with the pd alias. 
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With that, you should also note that it is a good practice to keep 
Pyforest limited to prototyping stages. This is because once you say, 
develop and open-source your pipeline, other users may face some 
difficulties understanding it. 


But if you are up for some casual experimentation, why not use it 
instead of manually writing all the imports? 


Read more: GitHub. 
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Analyse Flow Data With Sankey 
Diagrams 


Ə Sankey.py 


om ipysankeywidget import SankeyWidget 


source target 
New Delhi Washington 
Melbourne Washington 
London New York 
Dubai Los Angeles 
Dubai Washington 
London washington 
Melbourne Los Angeles 


SankeyWidget(links = df.to_dict("records")) 


Los Angeles 


Washington 


New York 
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Many tabular data analysis tasks can be interpreted as a flow between 
the source and a target. 


Here, manually analyzing tabular reports/data to draw insights is 
typically not the right approach. 


Instead, Flow diagrams serve as a great alternative in such cases. 
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Being visually appealing, they immensely assist you in drawing crucial 
insights from your data, which you may find challenging to infer by 
looking at the data manually. 


For instance, from the diagram above, one can quickly infer that: 
1. Washington hosts flights from all origins. 

2. New York only receives passengers from London. 

3. Majority of flights in Los Angeles come from Dubai. 

4. All flights from New Delhi go to Washington. 


Now imagine doing that by just looking at the tabular data. Not only 
will it be time-consuming, but there are chances that you may miss out 
on a few insights. 


To generate a flow diagram, you can use floWeaver. It helps you to 
visualize flow data using Sankey diagrams. 


Read more here: Documentation. 
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Feature Tracking Made Simple In 
Sklearn Transformers 


ee Ə numpy_output.py 


from sklearn.preprocessing 
import PolynomialFeatures 


df 
COL /\ Coll: R . k Baa ey | 


al 2 ‘ 3. r e, 12., 16.], 
z e, 30., 36.])) 


PolynomialFeatures().fit_transform(df) 


© © @ pandas_output.py 


from sklearn import set_config 


set_config(transform_output = "pandas") 


1 colA colB col A^2 col Acoli B col B^2 
COISATCOUSB 0 1.0 1.0 2.0 1.0 2.0 4.0 
5 Pee EL 3.0 4.0 9.0 12.0 16.0 


2 1.0 5.0 6.0 25.0 30.0 36.0 


PolynomialFeatures().fit_transform(df) 


Recently, scikit-learn announced the release of one of the most 
awaited improvements. In a gist, sklearn can now be configured to 
output Pandas DataFrames. 


Until now, Sklearn's transformers were configured to accept a Pandas 
DataFrame as input. But they always returned a NumPy array as an 
output. As a result, the output had to be manually projected back toa 
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Pandas DataFrame. This, at times, made it difficult to track and assign 
names to the features. 


For instance, consider the snippet above. 


In numpy_output.py, it is tricky to infer the name (or computation) 
of a column by looking at the NumPy array. 


However, in the upcoming release, the transformer can return a Pandas 
DataFrame (pandas_output.py). This makes tracking feature names 
incredibly simple. 


Read more: Release page. 
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Lesser-known Feature of f-strings in 
Python 


{Count}") 
orint(f"Fruit {Fruit}") 
HE Count = 2 
## Fruit = Apple 


{Count 


{Fruit 


#t Fruit 


While debugging, one often explicitly prints the name of the variable 
with its value to enhance code inspection. 


Although there's nothing wrong with this approach, it makes your print 
statements messy and lengthy. 


f-strings in Python offer an elegant solution for this. 


To print the name of the variable, you can add an equals sign (=) in the 
curly braces after the variable. This will print the name of the variable 
along with its value but it is concise and clean. 
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Don't Use time.time() To Measure 


Execution Time 


ee @ time.py 


import time 


start = time.time() 
time.sleep (10) 
end = time.time() 


print(end - start) 


HE 10.00482 


ee Ə perf_counter.py 


import time 


start = time.perf_counter() 


time.sleep (10) 


end = time.perf_counter() 


print(end - start) 


Ht 10.00435 


The time() method from the time library is frequently used to measure 


the execution time. 


However, time() is not meant for timing your code. Rather, its actual 
purpose is to tell the current time. This, at many times, compromises 


the accuracy of measuring the exact run time. 


The correct approach is to use perf_counter(), which deals with 


relative time. Thus, it is considered the most accurate way to time your 


code. 
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Now You Can Use DALL:E With 
OpenAI API 


© © DALL-E API 


import openai 
openai.api_key = "Your-API-Key" 


response = openai.Image.create( 
prompt="The city of Paris on Mars.", 


image_url = response['data'][0]['url'] 


in 
DALL-E is now accessible using the OpenAI API. 


OpenAl recently made a big announcement. In a gist, developers can 
now integrate OpenAI's popular text-to-image model DALL-E into 
their apps using OpenAI API. 


To achieve this, first, specify your API key (obtained after signup). 
Next, pass a text prompt to generate the corresponding image. 
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Polynomial Linear Regression Plot 
Made Easy With Seaborn 


Polynomial Regression 


seaborn sns 


sns.lmplot ( 
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While creating scatter plots, one is often interested in displaying a 
linear regression (simple or polynomial) fit on the data points. 


Here, training a model and manually embedding it in the plot can be a 
tedious job to do. 


Instead, with Seaborn's Implot(), you can add a regression fit to a plot, 
without explicitly training a model. 


Specify the degree of the polynomial as the "order" parameter. 
Seaborn will add the corresponding regression fit on the scatter plot. 


Read more here: Seaborn Docs. 
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Retrieve Previously Computed Output 
In Jupyter Notebook 


In [3]: df.groupby ("col1l").col2.mean().reset_index() 


Out[3]: 


In [4]: Out[3] 


Out[4]: 
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This is indeed one of the coolest things I have learned about Jupyter 
Notebooks recently. 


Have you ever been in a situation where you forgot to assign the 
results obtained after some computation to a variable? Left with no 
choice, one has to unwillingly recompute the result and assign it to a 
variable for further use. 


Thankfully, you don't have to do that anymore! 


IPython provides a dictionary "Out", which you can use to retrieve a 
cell's output. All you need to do is specify the cell number as the 
dictionary's key, which will return the corresponding output. Isn't that 
cool? 


View a video version of this post on LinkedIn: Post Link. 
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Parallelize Pandas Apply() With 
Swifter 


Pandas Apply 


H+ Shape: (10M, 4) 


def sum_row(row): 
return sum(row) 


df.apply(sum_row, axis 


© © Swifter Apply 


import swifter 


df.swifter.apply(sum_row, 
axis 


The Pandas library has no inherent support to parallelize its operations. 
Thus, it always adheres to a single-core computation, even when other 
cores are idle. 


Things get even worse when we use apply(). In Pandas, apply() is 
nothing but a glorified for-loop. As a result, it cannot even take 
advantage of vectorization. 


A quick solution to parallelize apply() is to use swifter instead. 


Swifter allows you to apply any function to a Pandas DataFrame in a 
parallelized manner. As a result, it provides considerable performance 
gains while preserving the old syntax. All you have to do is use 

df .swifter.apply instead of df.apply. 


Read more here: Swifter Docs. 
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Create DataFrame Hassle-free By 
Using Clipboard 


= N Products 


¥ O TOO one bo 12 Step 2: Read in 


#7 foo three 7 14 


Pandas 


print(df.loc[df['A'] == 'foo']) 
Read Clipboard 
yields 


pandas pd 
A 
foo 
foo pd.read_clipboard() 
foo 
foo 
foo three df.head() 


A 


Step 1: e 


2 foo 

Copy Table 4 foo 
: 6 foo one 

foo three 


in linkedin.com/in/avi-chawla 


Many Pandas users think that a DataFrame can ONLY be loaded from 
disk. However, this is not true. 


Imagine one wants to create a DataFrame from tabular data printed on 
a website. Here, they are most likely to be tempted to copy the 
contents to a CSV and read it using Pandas' read_csv() method. But 
this is not an ideal approach here. 


Instead, with the read_clipboa rd() method, you can eliminate the 
CSV step altogether. 


This method allows you to create a DataFrame from tabular data stored 
in a clipboard buffer. Thus, you just need to copy the data and invoke 
the method to create a DataFrame. This is an elegant approach that 
saves plenty of time. 


Read more here: Pandas Docs. 
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Run Python Project Directory As A 
Script 


® Directory ® Directory 


dummy_project dummy_project 
|— model.py = |— model.py 
ļ— app.py = |— __main__.py 
— validate.py — validate.py 
L— preprocess.py L— preprocess.py 


\ i 
Vv 


® Terminal Terminal 


$ python dummy_project/app.py $ python dummy_project 


Preprocessing Done! Preprocessing Done! 
Model Trained! Model Trained! 
Results Validated! Results Validated! 


in linkedin.com/in/avi-chawla 


A Python script is executed when we run a .py file. In large projects 
with many files, there's often a source (or base) Python file we begin 
our program from. 


To make things simpler, you can instead rename this base file to 
__main__.py. As a result, you can execute the whole pipeline by 
running the parent directory itself. 


This is concise and also makes it slightly easier for other users to use 
your project. 
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Inspect Program Flow with IceCream 


@ file.py 


def func(): 
print (0) 


yer ee Terminal 
if condition: 


print(1) 
Ane $ python file.py 
else: 

print (2) (0) 
2 


Ə file.py 


from icecream import ic 


def func(): 
ic() 


ee Terminal 


if condition: 
ic() 
$ python file.py 


else: 
ic() 


ic| file.py:4 in func() 
ic| file.py:10 in func() 


While debugging, one often writes many print() statements to inspect 
the program's flow. This is especially true when we have many IF 
conditions. 


Using empty ic() statements from the IceCream library can be a better 
alternative here. It outputs many additional details that help in 
inspecting the flow of the program. 


This includes the line number, the name of the function, the file name, 
etc. 


Read more in my Medium Blog: Link. 


232 


Eee avichawla.substack.com 


Don't Create Conditional Columns in 
Pandas with Apply 


© © Apply 


def assign_class (num): 


if num>0.5: 

return "Class A" 
else: 

return "Class B" 


df.col1l.apply(assign_class) 
Ht 987 ms + 47.1 ms per Loop 


© © Numpy Where 


import numpy as np 


np.where(df["col1"]>0.5, 
ABIES Az, 
“Class aa) 


H T94 mS = 25. / MS P 


While creating conditional columns in Pandas, we tend to use the 
apply() method almost all the time. 


However, apply() in Pandas is nothing but a glorified for-loop. As a 
result, it misses the whole point of vectorization. 


Instead, you should use the np.where() method to create conditional 
columns. It does the same job but is extremely fast. 


The condition is passed as the first argument. This is followed by the 
result if the condition evaluates to True (second argument) and False 
(third argument). 


Read more here: NumPy docs. 
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Pretty Plotting With Pandas 


Plotly Backend 


pandas pd 
pd.options.plotting.backend 'plotly' 
df.plot( "scatter', 


COLIN 
vcoit2e 


in linkedin.com/in/avi-chawla 


Matplotlib is the default plotting API of Pandas. This means you can 
create a Matplotlib plot in Pandas, without even importing it. 


Despite that, these plots have always been basic and not so visually 
appealing. Plotly, with its pretty and interactive plots, is often 
considered a suitable alternative. But familiarising yourself with a 
whole new library and its syntax can be time-consuming. 


Thankfully, Pandas does allow you to change the default plotting 
backend. Thus, you can leverage third-party visualization libraries for 
plotting with Pandas. This makes it effortless to create prettier plots 
while almost preserving the old syntax. 
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Build Baseline Models Effortlessly 
With Sklearn 


dummy.py 


sklearn. dummy DummyClassifier 
dummy_clf DummyCLassifier ( 
"most_frequent" 


Jo FLEK A) 


dummy_clf.predict(X) 
array ([0, 0, 0, 0, 0]) 


dummy_clf.score(X, y) 


in linkedin.com/in/avi-chawla 


Before developing a complex ML model, it is always sensible to create 
a baseline first. 


The baseline serves as a benchmark for the engineered model. 
Moreover, it ensures that the model is better than making random (or 
fixed) predictions. But building baselines with various strategies 
(random, fixed, most frequent, etc.) can be tedious. 


Instead, Sklearn's DummyClassifier() (and DummyRegressor()) 
makes it totally effortless and straightforward. You can select the 
specific behavior of the baseline with the strategy parameter. 


Read more here: Documentation. 
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Fine-grained Error Tracking With 
Python 3.11 


Error Tracking 


S python expt.py 
Traceback (most recent call last): 
File "expt.py", Line 11, module 
print (function(a=2, 0) ) 
File "expt.py", Line 6, in function 


(b b) 


division by zero 


in linkedin.com/in/avi-chawla 


Python 3.11 was released today, and many exciting features have been 
introduced. 


For instance, various speed improvements have been implemented. As 
per the official release, Python 3.11 is, on average, 25% faster than 
Python 3.10. Depending on your work, it can be up to 10-60% faster. 


One of the coolest features is the fine-grained error locations in 
tracebacks. 


In Python 3.10 and before, the interpreter showed the specific line that 
caused the error. This, at many times, caused ambiguity during 
debugging. 


In Python 3.11, the interpreter will point to the exact location that 
caused the error. This will immensely help programmers during 
debugging. 


Read more here: Official Release. 
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Find Your Code Hiding In Some 
Jupyter Notebook With Ease 


Search in All 
Notebooks 


Command Line 


S srep POLE +. 1pynb 


numpy_lr.ipynb: np.polyfit(x, y, deg 
numpy_lr.ipynb: np.polyfit(x, y, deg = 


in linkedin.com/in/avi-chawla 


Programmers who use Jupyter often refer to their old notebooks to find 
a piece of code. 


However, it gets tedious when they have multiple files to look for and 
can't recall the specific notebook of interest. The file name 
Untitled1 .ipynb, ..., and Untitled82.ipynb, don't make it any 
easier. 


The "grep" command is a much better solution to this. Very know that 
you can use "grep" in the command line to search in notebooks, as you 
do in other files (.txt, for instance). This saves plenty of manual work 
and time. 


P.S. How do you find some previously written code in your notebooks 
(if not manually)? 
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Restart the Kernel Without Losing 
Variables 


ee Notebook.ipynb 


Tale velvie = 


[2]: %store value 


ee Notebook. ipynb 


[1]: %store -r value 


[2]: value 


While working in a Jupyter Notebook, you may want to restart the 
kernel due to several reasons. But before restarting, one often tends to 
dump data objects to disk to avoid recomputing them in the subsequent 
run. 


The "store" magic command serves as an ideal solution to this. Here, 
you can obtain a previously computed value even after restarting your 
kernel. What's more, you never need to go through the hassle of 
dumping the object to disk. 


238 


RA avichawla.substack.com 


How to Read Multiple CSV Files 
Efficiently 


ee Ə% Pandas_read.py 


import pandas as pd 

files = ["jan.csv", "feb.csv", 
cmar CSV Aa PECS Vin, 
"nmay.csv", "jun.csv"] 

## 300 MBs each 


ohig = H] 

for i in files: 
df_list.append(pd.read_csv(i) ) 

data = pd.concat(df_list) 


ee Ə Datatable_read.py 


import datatable as dt 


files = ["jan.csv", "feb.csv", 
smar esve aP E CS Ver 


"may.csv", "Jjun.csv"] 
## 300 MBs each 


df dt.iread(files) ## 
df dt.rbind(df) 
df df.to_pandas() 


In many situations, the data is often split into multiple CSV files and 
transferred to the DS/ML team for use. 


As Pandas does not support parallelization, one has to iterate over the 
list of files and read them one by one for further processing. 


"Datatable" can provide a quick fix for this. Instead of reading them 
iteratively with Pandas, you can use Datatable to read a bunch of files. 
Being parallelized, it provides a significant performance boost as 
compared to Pandas. 
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The performance gain is not just limited to I/O but is observed in many 
other tabular operations as well. 


Read more here: DataTable Docs. 
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Elegantly Plot the Decision Boundary 
of a Classifier 


mlxtend.plotting 
plot_decision_regions 


model LogisticRegression().fit(X, y) 


plot_decision_regions(X, y, model) 


N 
oO 
ped 
=] 
P=) 
© 
oO 
U 


Feature 1 


in| linkedin.com/in/avi-chawla 


Plotting the decision boundary of a classifier can reveal many crucial 
insights about its performance. 


Here, region-shaded plots are often considered a suitable choice for 
visualization purposes. But, explicitly creating one can be extremely 
time-consuming and complicated. 


Mlxtend condenses that to a simple one-liner in Python. Here, you can 
plot the decision boundary of a classifier with ease, by just providing it 
the model and the data. 
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An Elegant Way to Import Metrics 
From Sklearn 


from sklearn.metrics 

import accuracy_score, f1_score, 
precision_score, recall_score, 
roc_auc_score, 


>>> accuracy_score(y_true, y_pred) 
0.5 


>>> precision_score(y_true, y_pred) 


0.8 


from sklearn.metrics import get_scorer 


accuracy = get_scorer("accuracy") 
>>> accuracy._score_func(y_true, y_pred) 


0.5 


precision = get_scorer("precision") 
>>> precision._score_func(y_true, y_pred) 


0.8 


While using SCikit-learn, one often imports multiple metrics to 
evaluate a model. Although there is nothing wrong with this practice, it 
makes the code inelegant and cluttered - with the initial few lines of 
the file overloaded with imports. 


Instead of importing the metrics individually, you can use the 
get_scorer() method. Here, you can pass the metric's name as a 
string, and it returns a scorer object for you. 


Read more here: Scikit-learn page. 
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Configure Sklearn To Output Pandas 
DataFrame 


ee Scikit-learn 1.1 


from sklearn.preprocessing 
import StandardScaler 


X_train = ... ## Pandas DataFrame 


scaler = StandardScaler() 
X_scaled = scaler. fit_transform(X_train) 


type(X_scaled) ## numpy.ndarray 


ee Scikit-learn 1.2.dev 


scaler = StandardScaler() 
scaler.set_output (transform="pandas") 
X_scaled = scaler. fit_transform(X_train) 


type(X_scaled) ## pandas.core.frame.DataFrame 


Recently, Scikit-learn announced the release of one of the most awaited 
improvements. In a gist, sklearn can now be configured to output 
Pandas DataFrames instead of NumPy arrays. 


Until now, Sklearn's transformers were configured to accept a Pandas 
DataFrame as input. But they always returned a NumPy array as an 
output. As a result, the output had to be manually projected back to a 
Pandas DataFrame. 


Now, the set_output API will let transformers output a Pandas 
DataFrame instead. 


This will make running pipelines on DataFrames smoother. Moreover, 
it will provide better ways to track feature names. 
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Display Progress Bar With Apply() in 
Pandas 


Without Progress 


pandas pd 


df .apply (func) 


With Progress 


pandas pd 


tqdm.notebook tqdm 
tqdm. pandas () 


df.progress_apply(func) 


100% (I 1000000/1000000 [00:04<00:00, 281929.55it/s] 


in linkedin.com/in/avi-chawla 


While applying a method to a DataFrame using apply(), we don't get 
to see the progress and an estimated remaining time. 


To resolve this, you can instead use progress_apply() from tqdm to 
display a progress bar while applying a method. 


Read more here: GitHub. 
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Modify a Function During Run-time 


tıme 
reloading 


(num): 
number 
"{number 


"{number 


j 


func (1) 


in linkedin.com/in/avi-chawla 


Have you ever been in a situation where you wished to add more 
details to an already running code? 


This is typically observed in ML where one often forgets to print all 
the essential training details/metrics. Executing the entire code again, 
especially when it has been up for some time is not an ideal approach 
here. 


If you want to modify a function during execution, decorate it with the 
reloading decorator (@reloading). As a result, Python will reload the 
function from the source before each execution. 


Link to reloading: GitHub. 
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Regression Plot Made Easy with Plotly 


Regression Plot 


plotly.express px 


fig px.scatter ( 


in linkedin.com/in/avi-chawla 


While creating scatter plots, one is often interested in displaying a 
simple linear regression fit on the data points. 


Here, training a model and manually embedding it in the plot can be a 
tedious job to do. 


Instead, with Plotly, you can add a regression line to a plot, without 
explicitly training a model. 


Read more here. 
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Polynomial Linear Regression with 
NumPy 


ee Sklearn 


#t 1 Degree Polynomial 
LinearRegression().fit(x, y) 


H Degree Polynomial 
np.hstack((x, x**2)) 
model = LinearRegression().fit(x, y) 


np.array([[x, x**2]]) 
. predict (inp) 


ee NumPy 


coeff Mo olwiric(<, W, Cle = 2) 
model np.poly1d(coeff) 


SS Alp) = A 
>>> model(inp) 


-10.4 


Polynomial linear regression using Sklearn is tedious as one has to 
explicitly code its features. This can get challenging when one has to 
iteratively build higher-degree polynomial models. 


NumPy's polyfit() method is an excellent alternative to this. Here, you 
can specify the degree of the polynomial as a parameter. As a result, it 
automatically creates the corresponding polynomial features. 


The downside is that you cannot add custom features such as 
trigonometric/logarithmic. In other words, you are restricted to only 
polynomial features. But if that is not your requirement, NumPy's 
polyfit() method can be a better approach. 


Read more: 
https://numpy.org/doc/stable/reference/qenerated/numpy.polyfit.html. 
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Alter the Datatype of Multiple 
Columns at Once 


col2 cols col4 
7 4 A 
9 6 B 
2 s A 


df["coli"] = df.coll.astype(np.int32) 
df["col2"] = df.col2.astype(np.int16) 
df["col3"] = df.col3.astype(np.floati6) 


df.astype({ 
LOO La! Sipe). THAMES. 
YEO LA SA) o TUMEALGS - 
"CoLl3":np.float16} 
) 


A common approach to alter the datatype of multiple columns is to 
invoke the astype() method individually for each column. 


Although the approach works as expected, it requires multiple function 
calls and more code. This can be particularly challenging when you 
want to modify the datatype of many columns. 


As a better approach, you can condense all the conversions into a 


single function call. This is achieved by passing a dictionary of 
column-to-datatype mapping, as shown below. 


248 


Pa avichawla.substack.com 


Datatype For Handling Missing 
Valued Columns in Pandas 


© © NaN column 


>>> Len(df.col1) 
## Total entries: 1,000,000 


>>> Llen(df[df.col1.isna()]) 
##t NaN entries: 700,000 (70%) 


ee Sparse Datatype 


df.col1.memory_usage() 
Ht Memory usage before conversion: 7.6 MB 


df["coli"] = df.coli.astype("Sparse[float32]") 


df.col1.memory_usage() 


44 


## Memory usage after conv 


If your data has NaN-valued columns, Pandas provides a datatype 
specifically for representing them - called the Sparse datatype. 


This is especially handy when you are working with large data-driven 
projects with many missing values. 


The snippet compares the memory usage of float and sparse datatype in 
Pandas. 
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Parallelize Pandas with Pandarallel 


© © Ə Pandarallel.py 


from pandarallel import pandarallel 
pandarallel.initialize() 


def add_row(row): 
return sum(row) 


HH 10M Rows, 2 Columns 


Apply vs Parallel Apply 


.apply(add_row, axis = 1) 
53 secs 


.paraLllel_apply(add_row, axis 
11 secs 


Pandas' operations do not support parallelization. As a result, it 
adheres to a single-core computation, even when other cores are 
available. This makes it inefficient and challenging, especially on large 
datasets. 


"Pandarallel" allows you to parallelize its operations to multiple CPU 
cores - by changing just one line of code. Supported methods include 
apply(), applymap(), groupby(), map() and rolling(). 


Read more: GitHub. 
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Why you should not dump 
DataFrames to a CSV 


Save DF 


df = ... Hf 1M Rows, 30 Columns 


die toecsvG files cs vies) 


df.to_pickle("file.pickle") 


df.to_parquet("file.parquet") 


N 
eo 


The CSV file format is widely used to save Pandas DataFrames. But 
are you aware of its limitations? To name a few, 


1. The CSV does not store the datatype information. Thus, if you 
modify the datatype of column(s), save it to a CSV, and load again, 


Pandas will not return the same datatypes. 


2. Saving the DataFrame to a CSV file format isn't as optimized as 
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other supported formats by Pandas. These include Parquet, Pickle, etc. 


Of course, if you need to view your data outside Python (Excel, for 
instance), you are bound to use a CSV. But if not, prefer other file 
formats. 


Further reading: Why | Stopped Dumping DataFrames to 
a CSV_and Why You Should Too. 
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Save Memory with Python Generators 


Generator.py 


1 sys i rt getsizeof rom sys ir rt getsizeof 


my_list HSF i in ran 10 my_gen (i 


#t use [] to create a 1 ## use () to create i 


getsizeof(my_gen) 


119 


sum(my_gen) 


sum(my_gen) 


in linkedin.com/in/avi-chawla 

If you use large static iterables in Python, a list may not be an optimal 
choice, especially in memory-constrained applications. 

A list stores the entire collection in memory. However, a generator 
computes and loads a single element at a time ONLY when it is 


required. This saves both memory and object creation time. 


Of course, there are some limitations of generators too. For instance, 
you cannot use common list operations such as append(), slicing, etc. 


Moreover, every time you want to reuse an element, it must be 
regenerated (see Generator.py: line 12). 
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Don't use print() to debug your code. 


© © Print 


def func(arr, 
Prane arr 


FUNE Lal 2 S| 
H arr = [1, 


© © Icecream 


from icecream import ic 


def func(arr, n): 
el(ecrs,, (hl) 


UTS (Chak PAE 2) 


Debugging with print statements is a messy and inelegant approach. It 
is confusing to map the output to its corresponding debug statement. 
Moreover, it requires extra manual formatting to comprehend the 
output. 


The "icecream" library in Python is an excellent alternative to this. 
It makes debugging effortless and readable, with minimal code. 
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Features include printing expressions, variable names, function names, 
line numbers, filenames, and many more. 


P.S. The snippet only gives a brief demonstration. However, the actual 
functionalities are much more powerful and elegant as compared to 
debugging with print(). 


More about icecream 


here: https://github.com/gruns/icecream. 
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Find Unused Python Code With Ease 


code.py 


sum_func (arr): 
sum(arr) emina 
ax inc (arr): x 
max_func (arr) $ vulture code.py 
max (arr) 
code.py:4: unused function 'max_func' 


u ; ", 
SER code.py:10: unused variable 'flag' 


input_arr 
flag 


input_sum sum_func (input_arr) 
print (input_sum) 


in linkedin.com/in/avi-chawla 


As the size of your codebase increases, so can the number of instances 
of unused code. This inhibits its readability and conciseness. 


With the "vulture" module in Python, you can locate dead (unused) 
code in your pipeline, as shown in the snippet. 
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Define the Correct DataType for 
Categorical Columns 


Reduce Memory Usage 
Categorical Col. 


pandas pd 
pandas pd 
df.Gender.memory_usage(), df.Gender.dtype 


Len(df.Gender) H 10.5 KB, object 
Ht 1500 


df["Gender"] df.Gender.astype("category") 
df .Gender.unique() 
"Male", "Female" df.Gender.memory_usage(), df.Gender.dtype 
8 I B, ( | Sl I ( A | type 


in linkedin.com/in/avi-chawla 


If your data has categorical columns, you should not represent them as 
int/string data type. 


Rather, Pandas provides an optimized data type specifically for 
categorical columns. This is especially handy when you are working 


with large data-driven projects. 


The snippet compares the memory usage of string and categorical data 
types in Pandas. 


257 


avichawla.substack.com 


Transfer Variables Between Jupyter 
Notebooks 


Notebook 1 


Value = 1€ 


%store value 


Notebook 2 


%store -r value 


(value) 


in linkedin.com/in/avi-chawla 


While working with multiple jupyter notebooks, you may need to share 
objects between them. 


With the "store" magic command, you can transfer variables across 
notebooks without storing them on disk. 


P.S. You can also restart the kernel and retrieve an old variable with 
"store". 
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Why You Should Not Read CSVs with 
Pandas 


Pandas Datatable 


meen CS Vin "Ile em CS Vin 
Ht IM rows and 30 columns 


datatable 


pandas pd 


dt.fread(file) 
pd.read_csv(fil df.to_pandas() 


ecs / HH 4.04 secs ine 


Pandas adheres to a single-core computation, which makes its 
operations extremely inefficient, especially on large datasets. 


The "datatable" library in Python is an excellent alternative with a 
Pandas-like API. Its multi-threaded data processing support makes it 
faster than Pandas. 


The snippet demonstrates the run-time comparison of creating a 
"Pandas DataFrame" from a CSV using Pandas and Datatable. 


259 


Les avichawla.substack.com 


Modify Python Code During Run- 
Time 


from time import sleep 
from reloading import reloading 


w 
© 
a 
a 


1 
7 
i. 
$ 
i 


nw 
=) 
a 
a 


for number in reloading( 


pets jt 
v 
jo) 
a 
a 


=) 
ee 
vw 
=) 
oC a 
Qa a 
a 


if number % 
(f"{number} is Odd") 


else: 
pass 


from time import sleep 
from reloading import reloading 


for number in reloading ( 


if number % 
(f"{number} is Odd") 


else: 
(f"{number} is Even") 


Have you ever been in a situation where you wished to add more 
details to an already running code (printing more details in a for-loop, 
for instance)? 


Executing the entire code again, especially when it has been up for 
some time, is not the ideal approach here. 


With the "reloading" library in Python, you can add more details to a 
running code without losing any existing progress. 
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Handle Missing Data With Missingno 


Missing Data 


BOROUGH 
ZIP CODE 
ON STREET NAME 
> CROSS STREET NAME 
pandas pd OFF STREET NAME 
missingno msno SUBBED OF CYCLISTE- THTURED 
a NUMBER OF CYCLISTS KILLED 
CONTRIBUTING FACTOR VEHICLE 2 
CONTRIBUTING FACTOR VEHICLE 3 
df = pd.read_csv("data.csv") CONTRIBUTING FACTOR VEHICLE 4 
CONTRIBUTING FACTOR VEHICLE 5 
VEHICLE TYPE CODE 1 
df.isnull().sum() VERICLE, TIPE CODE: 2 
VEHICLE TYPE CODE 3 
VEHICLE TYPE CODE 4 
: ER VEHICLE TYPE CODE 5 
msno.matrix(df) 


in linkedin.com/in/avi-chawla 


If you want to analyze missing values in your dataset, Pandas may not 
be an apt choice. 


Pandas' methods hide many important details about missing values. 
These include their location, periodicity, the correlation across 
columns, etc. 


The "missingno" library in Python is an excellent resource for 
exploring missing data. It generates informative visualizations for 


improved data analysis. 


The snippet demonstrates missing data analysis using Pandas and 
Missingno. 
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